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FOREWORD 


The  Twelfth  Image  Understanding  Workshop  sponsored  by  the  Defence  Advanced  Research 
projecta  Agency  (DARPA),  Information  Proceaalng  Techniques  Office  was  held  in  Washington,  D.C.  on 
April  23,  1981.  The  Workshop  was  conducted  in  conjunction  with  the  Technical  Symposium  test '81 
organised  by  the  Society  of  Photo-Optical  Inst ruswnt at  ion  Engineers  (SP1E). 

l.t.  Col.  Larry  E.  Druffel,  the  DARPA  program  manager  for  Image  Venders  tending,  acted 
as  the  Workshop  Ctialrman.  In  his  opening  remarks,  Lt.  Col.  Druffel  noted  that  he  expected  an  ex¬ 
citing  program  for  this  years  workshop.  He  advised  the  large  audience  that  he  was  enthused  by  the 
level  of  technical  progress  made  in  the  past  year  by  the  various  research  organisations  in  the 
DARPA  sponsored  program.  Also,  Druffel  indicated  that  he  was  pleased  with  the  association  with 
SP1K  as  this  provided  an  opportunity  for  Interchange  of  Ideas  between  cooperating  groups  which 
could  only  lead  toward  improvements  In  the  ongoing  research  programs.  The  SPIE  special  eeeetona  on 
Techniques  and  Applications  of  Imsge  Understanding  held  on  Tuesday  and  Wednesday  April  21-22, 
comawntrd  Druffel,  are  indicative  of  the  gt owing  interest  In  Image  Understanding  and  the  maturing 
nature  of  I.l'.  science.  We  believe  this  to  be  ui  excellent  opportunity,  he  noted,  for  researchers 
to  percleve  the  potent  lsl  uses  for  I.U.;  and  for  a  larger  audience  to  become  aware  or'  the  currant 
state-of-the-art . 

During  the  workshop,  fifteen  technical  papers  were  presented  by  members  of  Lhe  uni¬ 
versity  and  Industrial  organizations  involved  In  the  DARPA  sponsored  l.U.  program.  These  papers, 
and  others  of  current  Interact  for  which  time  was  not  available  for  presentation  at  the  workshop, 
are  contained  In  Section  1  of  these  proceedings.  In  order  to  reach  a  wider  audience  of  interested 
users  and  research  personnel,  many  of  these  papers  have  bean  provided  to  SPXE  for  Inclusion  In  their 
sympc-.inm  proceedings  as  well  as  being  printed  In  this  DARPA  workshop  issue.  Section  II  of  this  vol- 
ume  contains  brief  reviews  prepared  oy  the  principal  investigators  Involved  in  the  DARPA  sponsored 
research  which,  although  not  presented  due  to  the  press  of  time,  are  want  to  keep  the  various  members 
of  the  group,  as  well  as  those  government  personnel  who  have  been  Interested  In  the  l.U.  program, 
abreast  of  the  thrust  o(  the  research  efforts  being  undertaken  at  each  location. 

Readers  are  rescinded  that  extra  copies  of  these  proceedings  &s  well  as  cobles  of  pre¬ 
vious  proceedings  may  be  secured  from  the  Defense  Technical  Information  Center,  Cameron  Station, 
Alexandria,  Virginia  22314.  Accession  numbers  for  past  editions  are  as  shown  on  following  page. 

The  materials  for  the  cover  of  this  document  were  provided  by  Hr.  Bruce  Opitz  of  the 
Advanced  Technology  Division,  Headquarters  Defense  Happing  Agency.  The  accosqisnying  description  reads: 

imagery  Is  the  primary  source  used  by  the  Defense  Happing  Agency  (DMA), 

Many  different  kinds  of  information  My  be  extracted  depending  on  which 
products  are  required  over  each  area.  Currently,  this  photo-lnterpre- 
tatlon  function  Is  almost  totally  manual.  As  the  extraction  process  be¬ 
comes  increasingly  automated,  DMA  can  begin  to  extract  all  poaaible  need¬ 
ed  Information  over  an  area,  whether  lt  ts  currently  required  or  not,  and 
stored  In  a  universal  data  base.  Any  products  tailored  to  each  individual 
user's  needs  could  then  be  synthesized  from  this  data  base. 

Mr.  ton  Dickerson,  Science  Applications,  Inc.  did  the  layout  for  the  cover  dealgn.  Appre¬ 
ciation  Is  also  due  to  Hiss  Ann  Kastris  of  Science  Applications,  Inc.  for  assistance  in  putting  togeth¬ 
er  these  proceedings  and  for  handling  the  invitations  and  malllnga  neceaaary  to  the  conduct  of  the  work¬ 
shop.  Her  valuable  assistance  In  registration  and  the  myriad  of  other  deealle  was  also  a  sine  qua  non 
for  this  undertaking,  finally,  our  thanks  to  the  Society  of  Photo-Optical  Inatrumentation  Engineers  for 
their  cooperation  and  asalstance  during  the  planning  and  execution  of  this  combined  endeavor.  We  hope 
that  they  are  in  agreement  with  ua  that  lt  has  been  a  valuable  experience. 
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Lne  S .  Bsumanr. 

Science  Applications,  Inc. 
Workshop  Organ! ter 
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REASONING  ABOUT  IMAGES: 
APPLICATION  TO  AERIAL  IMAGE  UNDERSTANDING 


Peter  G.  Selfrldge 
Kenneth  P.  Slosn,  !r. 

Department  of  Computer  Science 
University  of  Rocheater 
Rochester.  New  York  1R627 

Abstract 


Image  Understanding  (I.U.)  shares  with 
Artificial  Intelligence  the  need  for 
mechanisms  of  using  knowledge  to  control 
computation.  In  I.U.,  this  knowledge 
takes  the  form  cf  prior  knowledge  about 
objects  and  knowledge  gained  from  doing 
the  computations  themselves.  This  paper 
presents  an  approach  to  the  general  I.U, 
problem  and  a  specific  program  for 
locating  buildings  In  aerial  photographs. 
Prior  knowledge  Is  stored  as  an  Appearance 
Model  ,  which  represents  the  appearances  of 
possible  buildings.  A  three  stage 
program,  starting  with  an  Appearance  Model 
Expert,  generates  operator  and  parameter 
sequences  to  achieve  recognition.  Some 
operators  are  Adapt  1 ve  Operators ,  which 
hill-climb  on  a  single"  parameter  to 
optimize  a  feature  match.  Each  level  uses 
partial  results  from  below  to  l)  search 
parameters  for  the  best  match,  21  Infer 
obscuring  image  conditions  and  deal  with 
them,  and  3)  Infer  high-level  conditions 
such  as  the  presence  of  occluding  objects. 


Introduction 

Image  Understanding  (I.U.)  shares 
many  problems  with  the  field  of  Artificial 
Intelligence  (A. 1,1.  The  use  of  knowledge 
to  control  irocesulng  Is  one  such  general 
problem.  It  an  I.U.  domain,  this  can  be 
formulated  as  the  problem  of  using  both 
prior  knowledge  about  the  task  domain,  and 
knowledge  aoqulred  as  processing  proceeds, 
to  ohoose  a  sequence  of  computations  which 
will  achieve  recognition  and  location  of 
the  desired  objects. 


It  is  a  tenet  of  this  reaearoh  that  a 
program  to  do  this  must  be  prepared  to  use 
partial  results,  suoh  as  partial  matohas, 
to  change  the  sequence  of  operators  and 
their  parametars.  In  doing  so.  It  must  be 
able  to  evaluate  the  results  of  Its 
processing  and  be  prepared  to  Infer,  at 
several  levels,  what  lmaga  conditions  are 
Impeding  recognition  of  the  desired 
objects  . 

This  paper  describes  a  program  to 
locate  buildings  In  aerial  photographs 
using  the  above  approach.  The  program  has 
three  parts,  shown  In  Flgura  1,  each  of 
which  embodlas  different  levels  of 
knowledge  and  lnferenoe  ability. 

An  Appear  ance  Model  Ex  pert  chooses 
subgraphs  of  the  building  Appeurance 
Model.  A  Region  Ue ac r lpt 1  on  la  derived 
from  this  sub-model  and  passed  to  the 
Operator  and  Image  Problem  Expert .  This 
Expert  chooses  operators  to  compute 
candidate  regions  and  calls  other 
operators  to  try  to  match  the  region 
characteristics  with  the  Region 
Description.  One  kind  of  operator  Is  an 
Adaptive  Operator .  which  can  vary  a  single 
parameter  to  optimise  the  matoh  with  c 
specific  value  or  range  of  a  faatura.  The 
Operator  and  Image  Problem  Expert  can 
infer  the  exlstance  of  a  poaslole  lmaga 
problem  and  take  appropriate  action.  The 
Appearance  Model  Expert  oan  Infer 
high-level  conditions  and  oheok  spatial 
relations  between  regions  to  recognize 
buildings  complexes. 


Wo  have  formalized  the  I.U.  problem 
in  the  following  way.  Prior  object 
knowledge  Is  stored  in  an  Appearance 
Model  .  a  graph  structure  encoding  the 
expected  appearance  of  the  objects. 
Computational  Pr la  1 1 1 v es  are  routines  that 
compute  on  Images  or  derived  data 
structures.  The  I.U.  problem  is  to  find 
a  correspondence  between  sub-graphs  of  the 
appearance  model  and  subsets  (regions)  of 
the  Input  image,  using  the  available 
computational  primitives. 
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Adaptive  Operators 


Appearance  Models 

and 

The  Appearance  M  0  d_e  1 

Expert 

Appearance  Model 

Is  a 

data 

structure  encoding  the  expected 

appearances  of  a  class  of  objects.  This 
model  could  be  derived  by  a  "smart" 
modeller  from  many  sources  of  prior 
knowledge  such  as  a  light  model,  a  camera 
model,  and  a  3D  domain  model.  The 
"appearance  primitive"  In  our  model  Is  a 
region.  A  simplified  Appearance  Model  for 
buildings  Is  shown  In  Figure  2,  where 
sub-model  nodes  are  represented  as 
rectangles,  property  nodes  as  ovals,  and 
property  vaiue  nodes  (place-holders  for 
actual  values)  as  double  ovals. 

The  Appearance  Model  Expert  uses 
rule-based  knowledge  of  the  semantics  of 
the  Appearance  Model,  and  any  partial 
results  achieved  so  far,  to  choose 
sub-models  representing  the  current  goal. 
For  example,  some  sample  rules  In  English 
are  : 


An  Adaptive  Operator  Is  a  program 
that  Is  given  a  candidate  region  nnd  a 
desired  range  of  property  values  for  a 
single  property.  This  represents  a  goal 
to  the  operator  of  computing  a  derived 
region  from  the  candidate  with  Its 
property  as  close  to  the  desired  range  as 
possible.  Each  Adaptive  Operator  has 

knowledge  of  the  effects  of  varying  a 
single  parameter  of  a  single  operator.  An 
Adaptive  Operator  uses  this  knowledge  to 
vary  the  parameter,  creating  derived 
regions  Trom  the  candidate,  to  get  a 
region  that  is  the  best  match  possible. 

For  example,  Adaptive-Match-On-Hi ze 
knows  that  for  a  dark  region  created  by 
thresholding,  raising  that  threshold  will 
shrink  the  size  of  the  region,  and 
lowering  it  will  Increase  the  size.  It. 
uses  that  knowledge  to  get  the  region  size 
as  close  as  possible  to  a  given  size 
range. 


If  no  Candidates  then 

pick  "Easiest"  sub-model 

If  Current-Model  decomposable  then 
locate  Sub-Components 
Check  Interrelations 

If  Partial-Success  then 

check  "close"  other  models 

If  Occlusion-Indicated  then 

locate  possible  Occluding  object 

These  rules  result  in  a  subgraph 
being  choosen.  From  that  subgraph,  a 
Region  Description  (RD)  Is  created  by 
taking  all  the  region  property  values  from 
that  subgraph.  This  RD  is  a 

representation  of  the  desired  object  in 
terms  of  expected  region  properties.  This 
RD  is  passed  to  the  Operator  and  Image 
Problem  Expert. 

Operator  and  Image  Problem  Expert 

The  Operator  and  Image  Problem  Expert 
is  given  a  Region  Description  from  the 
Appearance  Model  Expert,  and  attempts  to 
locate  a  region  with  the  desired 
properties.  To  do  this,  it  invokes 
appropriate  operators  to  generate 
candidate  regions,  and  calls  Adaptive 
Operators  to  try  to  achieve  good  matches 
on  specific  properties. 

As  will  be  desorlbnd,  this  process 
can  also  Infer  possible  Image  problems 
from  partial  matches.  Onoe  a  problem  la 
lnfered,  an  operator  can  be  choosen  to 
alleviate  the  condition,  or  more 
information  oan  he  computed  and  the  Expert 
can  try  again  with  the  new  constraints. 


Partial  Resul.t_s  and  Im age  Prob  1  ema 

Partial  results  are  dealt  with  at 
each  level  in  our  system.  Adaptive 
Operators  deal  with  partial  results  in  the 
form  of  a  non-optimal  match  by  attemting 
to  derive  regions  that  are  a  better  match 
to  a  given  property. 

Another  kind  of  partial  result  dealt 
with  by  the  Operator  and  Image  problem 
Expert  is  when  a  candidate  region  is 
missing  some  essential  feature.  For 
example.  If  after  applying  all  relevant 
computations  In  attempting  to  find  a 
rectangular  region,  the  resulting  region 
has  only  three  good  corners,  a  hypothesis 
will  be  made  that  a  Merging  condition  Is 
present  In  the  Image.  Merging  is  one  of  a 
number  of  domain  Independent  image 
conditions  that  are  handled  at  this  level. 
Once  this  hypothesis  is  made,  several 
alternatives  can  be  considered. 
Basically,  what  happens  is  that  model 
completion  is  done,  resulting  In  a  model 
of  the  merged  building.  This  more 
complete  model  represents  more  known 
constraints.  Other  operators  can  now  be 
brought  to  bear  in  an  attempt  to 
instantiate  the  refined  model.  Figure  3 
illustrates  a  hypothetical  example. 

Finally,  the  Ooerator  and  Image 
Problem  Expert  may  fail  to  achieve  a 
perf'eot  match.  The  Appearance  Model 
Expert  now  has  several  al teinwt Ives , 
depending  on  the  severity  cf  the  failure. 
The  Appearance  Model  Expert  may  decide 
that  a  region  computed  from  below  Is  an 
instantiation  of  a  sub-model  other  than 
the  one  under  consideration,  and  it  will 
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3.  Threshold  s  150. 
Size  *  14. 


'*%  ' 


4 .  Threshold  -  131 
Size  =  19. 


This  candidate  is  now  the  right  size. 
Adaptive-Rectangle  attempts  to  see  this 
region  as  a  rectangle,  and  It  succeeds. 
This  result  Is  passed  to  the  Appearance 
Model  Expert,  who  now  tries  to  find  the 
shadow,  generating  the  following  Region 
Description : 

Region  Description: 

Dark 

10  <  Size  <  20 

in  the  rectangle:  20,  5,  35,  20 

The  last  attribute  Indicates  to  look  In  a 
rectangle  around  the  located  building, 
me  Operator  and  Image  Problem  Expert 
calls  Adaptive-Match-On-Si ze  and  again 
succeeds.  The  adjacency  condition  is 
checked,  and  succeeds  as  well.  The  two 
regions,  corresponding  to  the  located 
building  and  shadow,  are  outlined  here: 


The  example  above  illustrates  two 
features  of  our  system.  First,  It  shows 
as  Adaptive  Operator  at  work  generating 
derived  regions  to  satisfy  some  criterion 
(region  size).  Second,  it  shows  the 
Appearance  Model  Expert  locating  an  objeot 
by  purts,  using  a  constraint  (adjacency  of 
parts)  to  limit  search  for  the  second 
part.  it  does  not  Illustrate  a  current 
ability  of  the  Operator  and  Image  Problem 
expert:  the  detection  of  random  noise  and 

its  application  of  an  operator  (local 
averaging)  to  alleviate  It. 

Other  workers  In  this  domain  have 
written  programs  embodying  some  of  the 
approachs  of  our  program,  and  have 
achieved  some  good  results!  1,  2,  3.  4], 
The  analog  of  our  Appearance  Model  la 
usually  encoded  as  a  fixed  set  of  region 
properties,  rather  than  a  graph  of 
different  alternatives  amenable  to 
intelligent  interpretation! 1 ,  2J. 
Information  from  located  objects  Is  used 
in  [  1  ,  2]  to  refine  ,<  property  table, 
which  can  result  in  a  new  segmentation. 
However,  little  use  of  partial  matches  is 
made,  and  none  with  the  flexibility  of  our 
system . 

The  most  novel  features  of  our 
approach  arc  the  Adaptive  Operators. 
While  the  hlll-cllmbing  techniques  they 
embody  are  classic,  indeed,  amoung  the 
oldest  in  A. I.  [5),  their  application  to 
an  I. U.  domain  13  new,  as  is  their  use  in 
a  framework  of  our  kind.  It  Is  Important 
to  recognize  that  adaptive  techniques,  as 
all  other  computation,  must  be  used  in 
conjunction  with  an  appropriate  control 
structure.  In  our  case.  Adaptive 
Operators  are  applicable  only  if  the 
candidate  region  is  already  "close". 

The  framework  of  our  program  provides 
a  flexible  rule-based  system  that  applies 
operators  when  needed  and  relies  on  the 
Adaptive  Operators  to  fine  tune  candidate 
regions.  Each  level  in  our  program  can 
evaluate  and  use  the  results  from  the 
level  below,  and  deal  with  partial 
successes  in  a  manner  appropriate  to  that 
level  . 


Conclusion 

Our  program,  as  it  stands,  Is  still 
very  sparse.  It  been  run  on  six  toy 
images  like  the  one  presented  here.  More 
Interesting  results  await  incorpo ration  of 
more  Computional  Primitives  and  rules  to 
guide  them.  Further  details  will  be 
presented  in  [6]. 
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Figure  1 :  A  Building  Location  Program 
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Figure  2:  A  Partial  Appearance  Model  for  Buildings 
in  Aerial  Photographs 
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Figure  3:  Model  Completion  by  the  Image  Problem  Export 
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1.  Abstract 

This  paper  is  concerned  with  the  use  of  a  database  to  support 
Automated  photo  interpretation.  The  function  of  the  database  is  to 
provide  an  environment  in  which  to  perform  photo  Interpretation 
utilizing  software  tools,  and  represent  domain  knowledge  about 
the  scenes  being  interpreted.  Within  the  framework  of  the 
database,  image  interpretation  systems  use  knowledge  stand  as 
map,  terrain,  or  scene  descriptions  to  provide  structural  or  spatirl 
constraints  to  guide  human  and  machine  processing.  We 
describe  one  such  system  under  development.  MAPS  (Map 
As3»ted  Photo  interpretation  System),  and  givj  some  general 
rationales  for  its  design  and  implementation, 

2.  Introduction 

One  characteristic  of  contemporary  research  in  natural  scene 
analysis  is  that  knowledge  about  scene  content  is  often  highly 
integrated  into  tha  recognition  program. '  The  knowledge  which 
ha3  been  most  often  implemented  is  spectial  constraints;  ie. 
charac'oristic  signatures  of  water,  vegetation,  manmade  objects 
etc.,  when  detected  by  sensors  having  known  signal  roponse 
characteristics.  Some  examples  of  such  sensors  are,  multi¬ 
channel  infrared,  radar  (FUR,  SLR),  and  multi-spectral  (mss) 
(LANDSAT,  SEASAT,  etc.),  in  many  remote  sensing  applications, 
where  the  size  (grain)  of  manmade  physical  objects  is  much 
smaller  than  the  resolving  power  of  the  sensor,  statistical 
techniques  are  employed  to  classify  relatively  large  land  areas. 
These  techniques  have  found  application  in  areas  such  as  land 
use  management,  forestry,  and  geological  mapping.  However, 
pattern  recognition  and  remote  sensing  techniques  based  purely 
on  spectral  analysis  cannot  handle  those  classes  of  imagery 
where  individual  feature  detail  is  complex,  and  where  shadows, 
occlusions  and  other  30  scene  domain  effects  predominate. 


•Thi*  reioiich  wti  &Qon«ored  by  the  Detenae  Advanced  R«*»»rch  Proiecta 
Agency  (OOD).  ARPA  Order  No.  3S97,  and  monitored  by  tho  Air  Force  Avionics 
laboratory  under  Contract  F33615  78-C  1S51.  The  views  and  conclusions  in  this 
document  are  those  oi  the  author  and  should  not  be  interpreted  as  representing 
ttw  oil  cist  poitetea,  eithar  exprssaed  or  implied,  ol  the  Defense  Advanced 
Research  Projects  Agency  or  the  U  S.  Government. 


in  the  work  on  MAPS  discussed  in  this  paper,  we  are  interested 
in  applying  spatial  and  structural  constraints  to  the  interpretation 
of  high  resolution  aerial  mapping  and  satellite  photographs. 
Briefly,  these  constraints  can  be  used  to  determine  "where  to 
look''  and  ’what  to  look  tor".  These  constraints  are  represented 
as  a  map  detaOase.  The  map  database  itsei'  is  incrementally 
generated  through  interaction  with  the  system,  from  human  and 
machine  segmentations  of  aerial  imagery,  and  from  collateral 
data.  Images  are  registered  to  the  existing  map  through  an 
interact!  re  correspondence  procedure,  in  which  a  human 
operator  specifies  image-to  map  correspondence  guided  by  a 
landmark  database.  Once  an  initial  correspondence  has  been 
obtained,  it  is  possible  to  apply  map  domain  knowledge  to  refine 
the  correspondence  and  guide  turther  image  processing,  it  is  this 
iterative  procedure,  using  map  knowledge  to  v,uide  processing 
and  assimilating  results  buck  into  the  map  database,  that  is  the 
core  of  future  photo  interpretation  systems.  Further,  as  we  begin 
to  demonstrate  the  competency  of  automatic  techniques  for 
feature  extraction,  registration,  and  identification,  these  systems 
can  gradually  replace  their  interactive  counterparts. 

!/» the  following  sections  we  will  examine  several  tasks  in  which 
spatial  and  structural  constraints  can  be  used  to  guide  both 
interactive  and  automated  photo  interpretation.  The  major 
components  of  our  system  will  be  presented. 

3.  MAPS  System  Goals 

MAPS  is  a  collection  of  interactive  display-oriented  expert 
programs  which  represent  and  utilize  map.  terrain,  and  image  data 
over  a  large  area  centered  over  Washington  D.C.  For  a  detailed 
discussion  ot  the  task  domain  and  data  representation  and 
organization  see  [5). 

There  are  several  major  goals  of  this  research: 

•  Show  that  map  knowledge  con  be  incrementally 
compiled  from  a  collection  of  time  ordered  aerial 
photographs.  Such  knowledge  is  composed  o' 
structural  and  spatial  feature  descriptions,  and  a  large 
scale  spatial  organization  (conceptual  map). 
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•  ['"'Tor strate  that  map  knowledge  can  be  used  to  aid 
in  the  automatic  interpretation  ot  aerial  imagery  by 
providing  spatial  and  structural  constraints. 

•  Experiment  with  facilities  which  support  data  modeled 
at  multiple  levels  of  resolution,  and  which  provide 
computation  in  the  symbolic  domain:  conceptual  map, 
terrain  descriptions,  and  scene  descriptions. 

•  Provide  an  integrated  database  access  and  display 
capability  to  allow  users  to  view  map,  terrain  and 
feature  descriptions  superimposed  on  image  data. 


3.1.  System  Data 

The  organization  of  a  database  system,  which  includes  large 
numbor3  of  complex  imagery  and  collateral  data,  is  a  major 
research  and  design  problem  in  its  r  right.  The  reader  is 
referred  to  (Sj  for  a  discussion  of  primative  data  types  and 
operations  in  mapping  signal  domain  data  into  symbolic 
representations.  The  following  section  tabulates  the  signal  data 
represented  in  MAPS. 

•  Image  Data 

o  Approximately  40  aerial  mapping  photographs 
providing  temporal  and  spatial  overlap.  Scale 
ranges  from  1:12000  to  1,36000,  digitized  at 
lOOmicrcns  aperture  to  image  format  of 
2043x2048  with  8  bits  of  intensity  information 
per  nixeil. 

o  Color  aid  r.iutu  spectral  satellite  imageiy  (6)  in 
resolutions  of  1:60000  and  1.1000000  for  large 
area  coverage  at  lower  resolution. 

V  Digitized  images  from  United  States  r.eologlc 
Survey  (USGS)  topographic  maps  and  common 
tourist  guide  maps.  These  images  are  currently 
used  as  a  graphical  interface  to  provide 
coverage  information  and  cue.ng  for  users. 

e  Terrain  Data 

o  USGS  digital  terrain  database,  elevations  over  3 
second  aquare  grid  in  meters  above  sea  level. 

Data  is  organized  into  15  minute  quadrants, 
each  containing  90,000  points.  Access  to  any 
grid  point  within  the  area  bounded  by  <North 
38deg,  West  76deg>  and  <North  30deg,  West 
78deg>. 

•  Map  Data 

o  Defense  Mapping  Agency  (DMA)  radar 
simulation  database.  Provides  accurate 
positional  information  for  large  manmade 
structures  and  hydrographic  features. 
Information  as  to  composition  and  classification 
of  features  (bridge,  commercial  buildings  etc.) 
is  provided  in  addition  to  vector  list 
descriptions, 


4.  Integrated  Database  Components 

In  this  section  we  will  highlight  some  of  the  major  components 
of  MAPS.  Much  of  the  detail  has  been  omitted  in  the  interest  of 
presenting  a  broad  overview  of  the.  major  design  concepts  and 
capabilities.  Our  philosophy  in  the  system  design  has  been  to 
decentralize  the  organization  of  MAPS  into  separate  processes, 
with  each  process  having  a  particular  area  of  expertise  and 
communicating  through  well  defined  data  structures  or  files.  In 
some  cases,  where  a  closer  coupling  is  desired  due  to  trequent 
communications,  we  envision  using  an  interprocess 
communication  mechanism  (IPC)  [8].  Each  component  is  capable 
of  stand-alone  execution  which  facilitates  incremental  system 
development  and  integration.  In  addition,  MAPS  components  are 
valuable  research  tools  In  their  own  right  for  other  members  of  our 
research  group.  The  current  MAPS  implementation  runs  on  a 
VAX  1 1  /760  running  the  UNIX  operating  system  with  3  megabytes 
of  memory  and  700  megabytes  of  disk  storage.  Graphics  display 
hardware  includes  a  Grinned  frame  buffer  display  connected 
directly  to  the  VAX  over  a  DMA  interlace  with  hardware  zoom,  pan, 
video  digitizer,  and  tablet  Inout.  A  Dunn  film  recorder  provides 
35mm  slide,  SX  70,  and  Poloroid  8x10  hard  copy. 

4.1 .  Intelligent  Image  Display 

Window  oriented  display 

One  of  the  central  components  ol  the  MAPS  system  is  BROWSE 
[6],  an  interactive  raster  image  display  facility  BROWSE  is  a 
window  oriented  display  manager  which  allocates  and 
manipulates  three  entities,  frames,  windows,  and  images.  A  frame 
is  an  allocation  ol  z  oufler  space  from  our  Grinned  trams  buffer 
memory,  which  has  32  bit  planes  dimansioned  as  512x512  pixels. 
BROWSE  maintains  the  state  ol  special  hardware  such  as 
programmable  cursors,  replicated  zoom  and  pan,  and  overlay 
memory  so  that  it  is  indeperdently  available  io  each  frame- 
Multiple  frames  can  be  allocated  by  the  user,  limited  only  by 
Grinned  memory.  Tho  most  frequent  modes  of  operation  are 
allocation  of  four  monochromatic  frames  of  8  bit  planes  per  pixel 
each,  or  a  single  RGB  (rame  with  8  bits  per  primary  color  and  a 
monochromatic  frame.  By  toggling  between  frames  one  can 
quickly  show  stereo  pairs,  time  ordered  sequences  or  illustrate 
map  to  image  correspondence. 

Window  operations 

Windows  are  dynamically  created  within  a  frame  and  are  "• 
window  into "  a  portion  of  a  specified  image.  Windows  are 
displayed  within  frames  at  positions  determined  by  the  interactive 
user,  or  by  BROWSE.  Operations  on  windows  include:  delete, 
create,  copy,  move  location,  adjust  position  ol  image,  expand  and 
shrink  size,  raise  to  top,  and  zoom.  A  general  image  windowing 
capability  is  necessary  because  ol  the  mismatch  between  our 
ability  to  digitize  images  (4006x4096  pixels)  and  generally 
available  raster  display  technology  (512x512).  In  addition, 
simultaneous  display  of  multiple  windows  from  different  images 
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can  be  used  in  interactive  stereo,  time  sequence  and  change 
detection  tasks,  and  image  selection  from  a  menu  of  image 
fragments. 

Symbolic  naming 

BROWSE  allows  for  the  symbolic  assignment  of  names  to  image 
files.  A  standard  set  of  command  fifes  are  provided  which  isotale 
the  naive  user  from  the  actual  organization  of  the  image  fUe 
system.  For  example,  the  command  "< del 420" 

open  /v i se/washdc/egl /ticl420/lt)w.  lag  lbe!420  nap 

open  /vise/eashdc/wgl /Jcl4?0/26». Ing  2bwl420  nap 

open  /vise/washdc/«g)/dcl420/4bw. lag  4nel420  nap 

open  /vise/w*thdc/wg1/dcl420/8b«. iag  8bal420  nap 

dot  dcl420  6  napped 

setnap  dct423  /v1se/wasfidc/eg1/dct420/gnr.nap 
add*  in  dcl420  low. resolution  8bw!420 

"opens"  or  assigns  symbolic  names  ibw  U20, 2bw  1420,  otc.  to  an 
hierarchy  of  resolutions  of  the  generic  image  dc  1420.  A  frame  of 
type  mapped  is  then  defined  by  allocating  6  bit  planes  of  Grinned 
memory  and  is  given  the  name  etc  1420.  Next  a  file  containing  a 
lookup  table  mapping  function  which  ia  specialized  to  the  spectral 
characteristics  ol  this  image  is  associated  with  the  frame.  Finally, 
a  window  into  the  image  8bw1420  is  defined  in  frame  dc  1420  and 
given  the  symbolic  name  low  resolution.  Symbolic  names  are 
consistently  used  throughout  BROWSE  to  describe  frame a, 
windows  and  images.  The  command  interpreter  allows  for  unique 
subsliing  abbreviations  for  any  symbolic  name,  and  redirection  of 
input  from  a  command  file  (as  shown  above)  instead  of  the  user's 
terminal. 

Image  resolution* 

The  relationshi]  among  resolutions  is  often  explicitly  used  by 
BROWSE  commands,  which  perform  image  zooming,  image 
placement,  or  prompt  for  an  image  coordinate.  The  genera! 
paradigm  is  to  have  the  user  specify  an  area  of  interest  in  a  low 
resolution  which  allows  for  maximum  amount  of  image  context  to 
be  displayed.  Once  the  area  ol  interest  is  specified,  BROWSE 
automatically  creates  a  new  high-resolution  display  window.  The 
user  performs  operations  which  require  high  resolution,  such  as 
foature  identification  and  segmentation,  image  registration,  query 
specification,  in  this  window.  The  ability  to  rapidly  present  both 
the  entire  scene  at  reduced  detail  and  selected  portions  of  the 
image  at  higher  resolution  is  essential  to  interactive  photo 
interpretation. 

Photograph  1  illustrates  the  use  of  a  reduced  resolution  window 
low. resolution  to  provide  context  for  several  fuli  resolution 
windows,  jellerson. memorial,  Watergate. hotel,  monument, 
while. house  and  lincoln. memorial.  This  display  was  created  using 
the  windowing  and  zooming  features  ol  BROWSE.  A  partial 
segmentation  associated  with  this  image  is  displayed  in  color  from 
Grinned  overlay  memory.  The  loss  of  color  and  lack  of  contraat  in 
publication  may  make  it  difficult  to  see  the  vector  outlines. 


4.2.  Image  Segmentation 

In  MAPS,  map  knowledge  acquisition  involves  die  integration  of 
image  segmentations  end  collateral  map  data.  image 
segmentations  specify  a  20  vector  description  of  cultural  and 
natural  features.  Features  are  classified  as  point,  linear,  or  areal 
and  are  given  IdenUfierti  which  reflect  their  proper  name  (hennedy 
canter)  or  feature  type  (runweyt).  Labeling  ol  features  is 
performed  during  hand  segmentation,  or  as  a  post  processing 
operation  on  machine  generated  descriptions  Image 
segmentation  Was  can  be  generated  by  a  combination  ol  the 
following  three  procedures: 

Hand  generated 

The  user  interactively  soeciHoe  the  position  and  shape  of  a 
feature  using  a  high  lesoionon  display  and  a  cursor. 
Capahilltiea  include  editing  descriptions  and  image 
segmentation  display  in  multiple  levels  of  detail. 

Map  generated 

Given  an  Image  to  map  correspondence  we  can  use  existing 
image  database  segmentations  to  generate  a  segmentation 
tor  a  new  image.  This  first  order  approximation  can  be  edited 
by  hand,  or  processed  by  machine  to  yield  a  composite  image 
description 

Machine  generated 

Experimental  coarse-line  segmentation  using  region  growing 
and  edge  profile  analysis  has  begun  to  be  testad.  Briefly,  the 
technique  ia  to  use  a  coarse  hand  or  map  segmentation  to 
specify  the  area  within  which  a  detailed  machine 
segmentation  should  be  performed.  The  user  can 
incrementally  accept,  refect  or  odit  descriptions  as  they  are 
generated. 

Figure  2  is  plotted  Irom  the  image  segmentation  file  associated 
with  aerial  image  in  photograph  1.  Symbolic  na>..es  irom  tho 
segmentation  tile  have  been  positioned  by  hand.  The  area 
portrayed  corresponds  to  the  low. resolution  window. 

4.3.  Landmark  Selection 

in  order  to  perform  image  to  map  correspondence,  we  have 
assembled  a  landmark  database  of  approximately  140  landmark 
features.  Typical  features  include  road  intersections  and  traffic 
circles,  corners  of  park  areas,  bridge  access  rainps,  and  building 
corners  at  the  ground-level.  Selection  criteria  included  the 
following: 

Uniqueness 

Easily  visible  from  above,  uniquely  shaped,  easily  measured; 
e  g.  ends  or  junctions  of  linear  features  rather  than  "center". 

Non- temporal 

The  landmark  feature  should  not  be  sensitive  to  normal 
seasonal  changes  in  foliage  or  water  levels.  This  ruled  out 
many  interesting  river  and  park  landmarks  having  distinctive 
structure  in  aerial  photography. 
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Spatial 

Landmarks  should  be  spatlaHy  spread  In  order  to  proves  tha 
capability  to  perform  accurate  correspondence  over  the 
entire  task  area. 

2  Dimensional  and  3  Dimensional 

Initially  the  correspondence  between  our  map  and  data  base 
is  two  dimensional.  This  impiea  that  to  tures  with  extreme 
elevations,  such  as  the  root  ot  multi  story  buildings,  are  not 
appropriate.  Comers  of  buildings  (at  grouno  level)  are 
currently  used.  Howver,  we  expect  that  as  the  system 
grows.  3  d  features  will  be  accumulated  and  used  to  compute 
the  camera  model  ol  a  given  picture. 

Each  landmark  description  in  the  MAPS  database  consists  ot 
map  and  image  coordinates,  a  textual  description  ot  the  landmark, 
and  an  image  fragment  containing  the  landmark  point.  Figure  3 
gives  a  sample  landmark  description  from  the  database. 

The  landmarks  are  manually  created  and  edited  using  an 
interactive  display  program.  The  use  ol  the  landmark  database 
component  will  be  discussed  in  the  tollcwing  section  on  image-to- 
map  correspondence. 

IDM>  base  file  name  [mcphersq] 
latitude  N38  54  8  600 
long i tude  W77  2  2  400 

30  7, 1756  in  /v f se/washdc/wgl /dc386l7/lb«r .  img 
landmark  image  at  resolution  1 

mcpherson  square 

A  small  park  square  tn  District  of  Columbia  IN. 
Located  just  north  of  Veterans  Administration 
and  lafayette  Buildings.  Control  point  ts  tn 
upper  right  cornet  of  the  landmark  image. 

Figure  3:  Landmark  description  for  McPherson  Square 


4.4.  Image-to-Map  Correspondence 

In  order  to  integrate  image  segmentations,  map  image  data,  and 
collateral  map  data  such  as  USGS  and  DMA  feature  and  terrain 
descriptions  into  a  consistent  representation  we  perform  an 
image  to- map  correspondence.  The  step?  in  performing  the 
correspondence  between  a  new  image  and  tti.;  map  database  are 
as  follows: 

Step  1 .  Initial  identification 

Ah  initial  set  ol  corresponding  points  are  specified  by 
selecting  descriptions  from  the  landmark  database. 
Landmark  selection  is  performed  by  explicit  naming  or  menu 
selection  o!  landmark  image  fragments.  We  plan  to  extend 
this  to  selection  by  landmark  description,  e  g.  "bridge  over 
Potomac",  to  display  image  fragments  satisfying  the 
description.  Once  a  landmark  is  selected,  users  are 
prompted  to  graphically  indicate  the  corresponding  point  in 
the  new  image.  The  selected  landmark  image  fragment  is 
displayed  lor  reference.  If  the  user  specifies  correspondence 
in  a  low  resolution  window,  a  high- resolution  window  is 
automatically  created  and  the  point  is  respecilied  with  greats' 
precision.  An  initial  guess  ol  map  coverage  can  be  computed 
after  speciticmion  of  the  lirst  corresponding  point  using 


Image  scale,  digibutlon  site,  and  assuming  Image 
orientation.  This  estimate  allows  MAPS  to  generate  image 
fragments  of  plausible  landmarks  from  the  outset,  in  Heu  cC 
user  expertise  or  familiarity  with  the  region  covered  by  the 
image. 

Step  2.  Perform  correspondence 

Our  correspondence  process  (3]  is  invoked  with  s  file  of 
image  pixel  coordinates  and  corresponding  landmark 
latitude/longitudtt  pairs.  This  correspondence  pa/r  fife  Is 
used  by  MAPS  to  produce  coefficients  for  image-to-map 
linen,  second  order  and  third  order  polynomial 
appi'orimattc.is.  Maan  error,  variance  lor  each  point,  and  a 
measure  of  model  error  (Predicted  Sum  of  Square  (PSS)}  are 
computed.  The  best  order  model  is  evaluated  baaed  on  thuae 
measures.  Correspondence  coefficient  fifes  are  associated 
with  each  image  in  the  database. 

Step  3.  Landmark  candidate  generation 

The  coefficient  file  generated  In  step  2  la  used  to  calculate  the 
map  coordinate  of  the  tour  Image  corner  points.  These 
coordinates  are  used  to  search  the  landmark  database  for 
possible  new  landmarks  within  the  image  coverage 

Step  4.  Select,  Add.  Modify,  Delete 

The  image- map  correspondence  pairs  used  to  generate  the 
coefficient  file  in  stop  2  an  redisplayed.  Bounding  boxes 
indicating  the  original  point  selected  by  the  user,  and  the 
position  calculated  by  application  ol  map  to  image 
coefficients  are  superimposed  over  the  image  display. 
Landmark  candidates  generated  in  step  3  are  also  displayed. 
Tha  user  can  select  a  particular  landmark,  zoom  ihe  image  to 
view  a  high  resolution  window,  and  modify  or  delete 
correspondence  pairs.  Landmark  candidates  may  also  be 
selected  lor  addition  to  the  correspondence  pair  Hie. 

Step  5.  Iterate 

The  addition,  deletion  and  modification  ol  correspondence 
pairs  can  continue  by  invoking  the  correspondence  process 
(step  2)  as  desired.  Error  statistics  generated  by  the 
correspoitdence  routines  are  useful  lor  detecting  stopping 
conditions. 

Photograph  4  shows  a  sample  correspondence  display.  The 
position  ol  the  overlapping  rectangles  in  the  image  (color 
overlays)  indicates  to  the  user  the  magnitude  ol  the  local  error  tor 
each  of  the  correspondence  pairs.  The  mcpherson  squire 
landmark  previously  described  in  figure  3  is  zoomed  in  ewindow. 
Future  directions  in  the  area  ol  image-to-map  correspondence 
include  using  multiple  points  in  specification  ol  landmark  position, 
using  local  3D  scene  information,  and  automating  the  selection  ol 
landmark  and  image  correspondence  pairs. 

4.5.  Map  Database 

The  mep  detebete  component  ol  MAPS  is  central  to  providing 
access  to  Imagery,  guiding  photo  interpretation,  and  processing 
queries  about  manmade  and  natural  features.  Through  the  image- 
to  map  correspondence  process,  map  knowledge  can  be  applied 
to  any  image,  and  the  spatial  relationships  o I  sets  of  imagery  can 
be  established.  We  have  described  how  image  segmentations  -e 
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Figure  4:  imagv  to  Map Correspondence 


used  :o  extract  feature  descriptions  and  how  these  segmentations 
can  he  integrated  through  the  same  correspondence  process. 
Similarity,  feature  descriptions  from  collateral  data,  such  as  the 
CVjIk  radar  simulation  database,  are  included  in  MAPS. 

We  would  lire  to  extend  the  notion  of  a  purely  njature  oriented 
map  database  to  one  which  has  the  ability  to  represent  general 
spatial  knowledge  in  the  scene  oomain.  This  conceptual  map 
provides  a  framework  within  which  individual  map  features  can  be 
associated  with  high-level  semantic  map  desc  pttona. 
Conceptual  maps  capture  the  spatial  arrangement  in  urban  areas 
of  neighborhoods,  political,  and  geographical  boundaries.  For 
example,  terms  such  as  "Northwest  Washington",  "Foggy 
Bottom",  "Alexandria,  Virginia"  are  often  used  to  describe  general 
areas  within  and  around  Washington  D.C.  They  provide  an 
important  mechanism  for  symbolic  access  into  an  mage  database, 
e.g.  'display  images  ot  Georgetown  later  than  1976".  However, 
depicting  precise  boundaries  of  conceptual  features  from  aerial 


imagery  is  a  difficult  problem.  In  many  cases  boundaries  are  ill- 
defined  and  highly  dependent  on  the  user's  own  spatial  model. 
There  is  clearly  a  hierarchy  corresponding  to  levels  ot  detail 
among  conceptual  features  which  must  be  preserved  in  order  use 
such  knowledge  effectively. 

The  advantages  of  providing  such  a  representation  are 
important.  First,  conceptual  features  can  be  used  to  partition  the 
map  feature  space.  This  partitioning  would  be  based  on  natural 
spatial  re'ationships,  ones  which  are  likely  to  arise  in  database 
queries,  rather  than  artificial  cellular  or  raster  decompositions. 
Second,  many  queries  into  the  map  database  can  be  resolved  at 
the  symbolic  level,  without  resorting  to  geometric  computations. 
This  is  particularly  true  it  static  relationships  such  as  intersection, 
inclusion,  and  adjacency  can  be  pre  computed  and  represented  in 
the  conceptual  map.  Queries  of  the  form  "does  x  intersect  y" 
should  be  handled  by  looking  up  the  binary  relationship  intersect 
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lor  x,  y  and  all  entities  which  irmko  tip  x  and  y  Of  course,  when 
the  actual  location  ol  the  irte. section  Is  required,  a  geomotrlc 
operation  must  be  performed.  However,  if  x  and  y  are 
hierarchically  represented  hy  their  conceptual  components,  a 
symbolic  query  can  be  used  to  find  the  components  which 
Intersect,  and  perform  intersection  computations  locally. 

We  are  currently  beginning  work  on  conceptual  map 
representations  for  features  such  as  roads,  buildups,  and  bridges. 
Concurrently  we  aro  exploring  the  issue  of  spatial  containment 
and  hope  to  demonstrate  symbolic  image  access  using 
conceptual  maps  within  MAI  S. 

5.  Interface  to  Image  Processing 
Techniques 

Given  a  system  witn  the  capabilities  ot  MAPS,  we  believe  the 
prospects  lor  applying  current  imago  processing  technology  to 
detailed  urban  city  sceneu  improve  markedly.  Feature  or  landmark 
extraction  can  be  guided  by  spatial  and  structural  constraints. 
Even  errorlui  map  descriptions  can  allow  tor  the  localization  ot 
processing.  Given  an  assumption  of  consistency  In  error,  one  can 
compensate  by  relaxing  the  predicted  position  ol  an  object  in  the 
scene.  Such  pruning  ol  the  "search  space”  can  erihance  the 
usefulness  ot  local  image  segmentation  techniques  which  locus 
processing  in  the  local  area  with  approximately  known  properties. 
Examples  ot  current  local  techniques  include  edge  profile  onalysia 

[2],  local  image  registration  techniques  [4],  edge  'inking  by 
directional  analysis  [7),  and  hough  transiorms  {9]. 

Structural  constraints  in  the  form  ol  shape  knowledge  can  be 
used  to  fill  in  missing  or  noisy  data.  Complex  aerial  and  satellite 
imagery  contain  many  examples  ol  building  surfaces  in  shadow  or 
completely  occluded  by  other  buildings.  <v  roads  obscured  try  tree 
cover.  Besides  these  inherent  20  image  from  30  scene  problems, 
image  formation  and  illumination  conditions  often  make  it  difficult 
for  even  humans  tc  identity  '?ourvJar.es  between  features. 
However,  when  structure!  constraints  such  as  rectangular  shape 
or  direction  ot  gravity  are  given,  the  application  ot  gradient  space 
surface  orientation  theories  [1]can  be  made  tractable  tor  the 
scene  under  analysis. 

6.  Conclusions 

In  this  paper  we  hrve  described  the  major  components  ol  the 
MAPS  system:  intelligent  image  display,  image  segmentation, 
landmark  selection,  Image- to- map  correspondence,  and  the  map 
database.  The  use  of  spatial  and  structural  constraints  in  the 
interpretation  teak  has  been  our  main  locus.  We  are  continuing  to 
incrementally  develop  our  mop  representation  for  the  Washington 
DC,  are*. 
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extracting  the  image  velocities  froa  changes  In  the 

laage  Intensity  values  on  the  projection  surfaces. 


ABSTRACT 

A  aethod  Is  described  capable  of  decoapcslng 
the  optical  flow  Into  Its  rotational  and  transla¬ 
tional  components.  The  translational  coaponent 
Is  extracted  laplicltlyby  locating  the  focus  of 
expansion  associated  with  the  translation il  coupon- 
ent  of  the  relative  notion.  The  aethod  Is  staple, 
relying  on  minimising  an  (error)  function  of  3  para¬ 
meters.  As  such,  It  can  also  be  applied,  without 
aodlf Icatlon,  In  the  case  of  noisy  Input  Information. 
Unlike  the  previous  attempts  at  Interpreting  optical 
flow  to  obtain  elements,  the  aethod  uses  only  rela¬ 
tionships  between  quantities  on  the  projection  plane. 
No  3D  geometry  is  involved.  Also  outlined  is  a 
possible  use  of  the  method  for  the  extraction  of 
that  part  of  the  optical  flow  containing  information 
about  relative  depth  directly  from  the  image  inten¬ 
sity  values,  without  extracting  the  "retinal" 
velocity  vectors. 


INTRODUCTION 

The  distribution  of  velocities  on  the  projec¬ 
tion  surface  arising  as  a  consequence  of  the  rela¬ 
tive  motion  of  objects  with  respect  to  the  observer, 
the  optical  flow  (Gibson,  1950,  1955),  contains 
information  not  only  about  the  (relative)  motion 
itself,  but  also  about  the  three-dimensional 
disposition  of  the  set  of  terture  points  of  which 
a  given  set  of  Image  elements  is  a  projection 
Note  1>.  The  distribution  of  image  velocities  on 
the  projection  surface  Is  a  function  of  three  para¬ 
meters:  the  (relative)  motion  of  objects,  their 
distance  to  the  center  of  projection,  and  the  local 
three-dimensional  geometry  of  the  objects.  Fortun¬ 
ately,  however,  tneir  effects  are,  conceptually  at 
least,  separable  (Praidny,  1981). 

The  problem  if  extracting  information  from 
optical  flow  can  conveniently  be  divided  into  two 
stages.  First,  one  has  to  develop  a  method  of 


Following  this,  it  remains  to  solve  thn  problem  of 
computing  -  is  required  information  about  (local) 

surface  orientation,  relative  depth  and  motion  from 
the  distribution  of  these  velocitiee.  While  the 
interpretation  of  optical  flovs  logically  depends 
on  the  solution  to  the  problem  of  extracting  the 
constituent  velocities,  the  problem  can  and  should 
be  studied  on  its  own,  for  its  solution  not  only 
explicitly  voices  the  requirements  on  the  quality 
of  the  velocity  extraction  process,  but  also  deter¬ 
mines  the-  ultimate  success  of  the  whole  enterprise. 

Recently,  optical  flows  (and  time  varying 
imagery  in  general)  have  received  growing  attention 
among  the  computer  vision  community  as  a  source  of 
possible  information  about  a  scene.  Nakayama  and 
Loomis  (1974)  and  Fennema  and  Thompson  (1979) 
studied  how  discontinuities  in  the  "retinal"  velo¬ 
city  field  could  be  used  for  segmentation  purposes. 
Clocksin  (1980),  Gibson  et  si.  (1955),  and  Lee 
11974)  studied  how  the  optical  flow  generated  by  an 
observer  translating  in  a  stationary  world  provide? 
information  about  (local)  surface  orientation. 
Koenderink  and  van  Doom  (1976),  Longuet-Hlggine 
and  Prasdny  (1980),  and  Prasdny  (1980,  1981), 
studied  the  extraction  of  nurface  orientation, 
(relative)  depth  and  motion  from  optical  flow 
generated  by  an  arbitrary  curvilinear  motion. 

Another  kind  of  approach,  sometimes  considered  more 
suitable  for  a  computer  vlilon  system,  relies  on 
interpreting  a  sequence  of  static  Images  as  dis¬ 
crete  snapshots  (  see  e.g.,  Nagel,  1981;  Aggarwsl 
and  Badler,  1980).  The  computation  of  "retinal" 
velocities  from  image  intensity  values  was  studied 
bv  Fennema  and  Thompson  (1979),  Hadani  et  el.  (1980) 
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and  Horn  and  Schunck ( 1980) .  so  '2-  .ipproarhes 
baueu  011  oatchO'"  ";:luu»  iiigher-ucdei  image 
structures  obtained  from  two  (temporal ly)  consecu¬ 
tive  Images  weie  attempted  by,  for  example,  Barnard 
and  Thompson  (1980). 

In  this  paper,  wa  outline  a  method  f»,  ub 
talninc  the  instantaneous  direction  of  (relative) 
motion  from  optical  flow  manifested  as  image 
notions  on  the  planar  projection  surface.  We  use 
polar  projection  as  the  model  of  the  physical 
image  forming  process.  Also,  we  assume  throughout 
the  paper  that  our  world  contains  only  rigid  and 
opaque  objects.  The  method  presented  here  does 
not  use  projective  or  geometrical  relations,  as 
might  be  expected  from  the  use  of  the  polar  projec¬ 
tion;  it  1b  based  on  computations  rad  relationships 
defined  and  measurable  solely  on  the  projection 
plane.  For  example,  the  method  does  not  require 
knowing  the  visual  direction  (a  3D  vector)  of  a 
"retinal"  point  (as  was  vequlred,  e.g.,  in 
Prazdny,  1980). 

Before  outlining  the  method,  we  briefly  con¬ 
sider  a  few  relevant  facts.  Optical  flow  can  be 
(Instantaneously)  decomposed  Into  two  Independent 
components  (Koenderlnk  and  van  Doom,  1976; 

Nakayanu  and  Loomis,  1974) ,  a  rotational  and  a 

;!  —ciil.  livtici,  uniy  the  trans¬ 
lational  "retinal"  field  consists  ot  motion  along 
straight  lines  all  intersecting  at  n  common  point, 
the  focus  of  expansion  (FOE) .  This  point  corres¬ 
ponds  to  the  point  where  the  (three  dimensional) 
vector  tangent  to  the  motion  path  described  by 
0  at  a  given  instant)  pierces  the  projection  plane. 
Our  method,  by  searching  for  this  point,  effect¬ 
ively  decomposes  the  optical  flow  field  into  its 
two  constituent  fields.  Briefly,  the  method  Is 
baaed  on  minimising  an  error  function  of  three 
parameters.  The  construction  of  the  function 
reflects  the  following  observation:  if  the  three 
parameters  specifying  the  rotational  component  of 
the  (relative)  motion  ate  chosen  properly,  the 
translational  "retinal"  fi;*.d  yield  lines  all 
meeting  at  FOE.  We  do  not  require  the  spatial 
derivatives  of  the  "retinal"  velocity  field,  as  in 
Koenderink  and  van  Doom  (1976)  or  Longuet-Higgins 


and  Prasndy  (1980).  The  method  cm,  hut  does  not 
have  to  be,  Implemented  ee  a  local  computation. 
While  we  have  chosen,  for  simplicity,  to  consider 
only  the  case  of  an  observer  moving  tn  a  stationary 
world,  It  should  ba  notad  that  the  method  has  c 
much  -tore  far-reaching  implication.  Because  It 
produces  a  description  of  relative  motion,  It  can 
be  appl'ed  to  a  region  of  the  Image  locally  to 
describe  the  (relative)  morion  of  that  region 
independently. 

LOCATING  THE  FOCUS  OF  EXPANSION  (FOE) 

Vo  see  that  It  is  possible  to  decompose  the 
instantaneous  positional  velocity  field  on  the 
projection  plane  Into  its  t.wo  components,  consider 
the  effects  of  rotation  and  translation  separately. 
It  is  advantageous  to  imagine  that  the  optical  flow 
field  Is  generated  by  the  motion  of  the  observer  in 
a  stationary  environment.  This  conceptualization 
has  an  immediate  Interpretation  and  is,  of  course, 
legitimate,  for  all  motion  considered  here  Is 
relative. 

Consider  the  observer  rotation  first.  Because 
the  rotational  component  of  the  relative  motion 
does  not  carry  Information  about  the  3D  disposition 
of  the  texture  elements,  the  motion  of  an  image 
element  on  the  projection  plane,  will  depend  only  on 
its  position  on  PP.  A  rotation  vector  (angular 
velocity  vector  perpendicular  to  the  instantaneous 
plan?  of  rotation)  can  be  decomposed  into  two  com¬ 
ponents,  one  parallel  to  the  projection  plane  (PP) , 
and  one  perpendicular  to  It  (see  Figure  1)  <Note  2> 

Consider  the  rotation  about  the  vector  per¬ 
pendicular  to  PP  first  (the  t-axis  in  Figure  1). 

For  each  "retinal"  point  P  with  coordinates  (x,y)  , 
the  rotation  of  the  observer  about  the  z-axis  (per¬ 
pendicular  to  PP)  results  in  P  moving  along  a 
circular  trajectory  on  PP.  The  motion  of  P  on  PP 
i s  specified  by  a  direction  vector  c-(-y,x)/ 
v'O'^+y  2 ),  and  by  the  magnitude  c»CQ^(x2+y2) ,  where 
Cg  is  the  speed  of  a  "retinal"  point  at  a  unit 
distance  from  O’  (the  center  of  the  "retinal" 
coordinate  frame).  The  (2D)  velocity  of  P(x,y) 
due  to  observer  rotation  about  the  z-axis  is  thus 
given  by 

(1)  c(x,y)-c0(-y,x) 
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Consider  now  the  situation  it,  which  the  ob¬ 
server  rotates  about  a  vet  cor  parallel  to  the 
projection  plane  <Note  3\  To  simplify  the  dlscua- 
aion,  we  consider  only  rotation  about  an  axis 
(through  0)  parallel  to  the  "retinal"  y-axia.  The 
expression  for  rotation  n’.tout  a  parallel  to  the 
X-axis  is  symmetrical  in  the  coordinates  x  and  y 
(compare  equations  (5)  and  (6)].  We  first  show 
that  the  path  of  a  point  P(x,y)  under  rotation 
of  the  observer  about  the  y-axis  is  a  hyperbola, 
and  then  derive  the  expression  for  the  velocity 
vector  hjj  at  P(x,y). 

Consider  Figure  2.  A  stationary  texture 
element  in  the  3D  environment  projects  Into  a 
point  P(x,y)  on  PP.  As  PP  rotates  abort  a  line 
parallel  to  the  y-axiB,  the  coordinates  of  P  will 
eventually  become  P(0,yp).  Observe  that  the 
projecting  ray  defines  a  fixed  visual  angle  with 
respect  to  the  plane  of  rotation.  It  ia  clear 
from  Figure  2  that 

y0-tan  e  — f - 
x  +1 

This  is  because  the  distance  |00'|-1,  by  assump¬ 
tion  (this  effectively  scales  the  whole  projection 
system  by  the  focal  distance).  From  this,  we 
obtain 


This  is  the  equation  of  a  hyperbola  with  center  at 
origin  O'.  The  direction  of  the  velocity  vector  at 
P(x,y)  Is  determined  by  the  tangent  line  at  that 
point.  Differentiating  (2),  we  obtain 


y -  tant, 
x^+1 

(see  Figure  3) .  h„  is  thus  determined  by  h„  - 
n  H 

(cosf.,8inO  .  In  terms  of  the  ("retinal")  coordi¬ 
nates  (x,y)  of  P,  this  becomes 


(3)  ^ 


_ 1 _ 

((x2+l)+(xy)^),J 


(x2+l,  xy) 


To  determine  the  magnitude  of  h^,  consider  two 
fixed  points  R  and  S  on  two  rays  such  that  at 
time  tg,  the  points  coincide  with  points  P(0,yp) 
and  O',  respectively  (see  Figure  3).  The  two  rnys 


define  a  visual  angle£,  Now  at  a  time  tj  (after 
a  rotation  of  PP  by  some  angle),  R  projects  into 
the  point  (x,y)  while  S  projects  into  the  point 
(x,0).  It  is  evident  thnt  the  two  projection*  move 
sc  that  at  any  time,  their  x-coordlnate*  are  the 
same.  In  other  words,  the  x-componentB  of  their 
velocities  on  PP  are  the  same.  We  know  that  the 
path  of  P  is  a  hyperbola.  It  is  thus  sufficient  to 
compute  the  horizontal  velocity  component  and  pro¬ 
ject  it  back  onto  1^  to  obtain  1^,  the  magnitude  of 

>v 

Consider  Figure  4.  If  the  point  x  moves  with 
angular  velocity  (recall  that  |oo'|»l),  then  h^ 

is  defined  by 

h./TiT 
h  -  -i*.- 

x  coan 

But  cos 2  ,  (see  Figure  A)  so  that  h  »h  (x2+l) 
Projecting  h  back  on  li  and  combining  the  result 

X  n 

with  equation  (3)  we  obtain 

(5)  hjj-h^Cx^+l,  xy) 

Hero  is  the  3peed  of  a  "retinal"  element  at  O'. 
Analogousl”,  the  rotation  of  the  observer  about  an 
axi.3  (through  0)  parallel  to  the  "retinal"  x-nxls 
results  in  the  velocity  vector  by  defined  by 

(6)  (xy,  y2U) 

The  input  data  we  are  trying  to  interpret,  the 
optical  flows,  consist  of  a  set  of  vectors  v"  de¬ 
fining  the  positional  velocity  field  on  the  projec¬ 
tion  plane.  Because  we  ace  dealing  with  velocities 
it  is  easy  to  see  that 

(7)  v"  -  c  +  (h  +  O  +  t 

—  _  — V  ~H  _ 

where  t^  is  the  velocity  vector  due  to  the  pure 
translation  of  the  observer.  In  other  words,  for 
each  "retinal"  locus  (x,y) ,  and  a  set  of  parameters 
(h0H  co>hov^'  ea.uatlon  (?)  defines  a  vector  t^  which 
Is  a  vector  function  of  the  three  parameters. 

As  mentioned  above,  the  property  of  the 
translational  "retinal"  velocity  field  (defin’d  as 
straight  lines  specified  by  a  given  "relinal"  locus 
(x,y)  and  the  associated  vector  t)  is  that  all  the 
lines  intersect  at  one  common  point,  the  FOE  (see 
Figure  3) .  This  property  makes  it  possible  to 
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define  an  error  function  which  will  lead  to  resolu¬ 
tion  of  the  vector  field  v"  into  Its  rotational 
and  translational  components.  For  a  given  distri¬ 
bution  of  v1''  on  PP,  we  are  searching  for  those 
values  of  the  parameters  for  which 

the  set  T“{_t is  such  that  all  lines  L^  defined 
by  the  vector  t_^  and  the  retinal  locations  P^ 
intersect  at  a  common  point,  the  FOE. 

One  way  of  doing  this  is  as  follows.  Consider 
an  arbitrary  "retinal"  point  P  [with  coordinates 
(x,y)]  and  a  set  of  other  (possibly  neighboring) 
points  {P^}.  The  points  P  together  with  the 
vectors  t^  define  lines  L^  which  intersect  the 
line  L  at  points  (see  Figure  6) .  Consider  the 

lengths  £,  between  the  intersections  I,  and  the 

•  1  ~  2  1 
point  P.  The  variance  V  «  7.  (5.  -It)  /n  (where  f. 

i  1 

is  the  algebraic  average  of  the  i^s)  is  a  good 
measure  of  the  dispersion  of  the  Intersections  1^ . 
When  the  lines  L^  all  meet  at  FOE,  V-0.  To  cbtain 
the  FOE,  we  thus  simply  minimize  v(hgH*co*h(}V^ . 

Note  the  way  in  which  the  decomposition  is  accom¬ 
plished:  a  property  of  the  translational  field  Is 
here  used  to  obtain  the  rotational  field,  resulting 
in  both  fields  being  obtained  at  the  same  time,  by 
the  very  same  computation.  The  method,  being 
rainimalizaticn  of  a  distribution  measure,  can  also 
immediately  be  applied  when  the.  input  data  (the 
vectors  v")  are  noisy. 

SOME  EXPERIMENTAL  RESULTS 

The  echema  described  above  was  tested  in  a 
(simulated)  world  of  planar  surfaces.  The  resultr. 
are  encouraging.  Eight  points  surrounding  a  cen¬ 
tral  point  were  used  to  define  the  set  {P^}  to 
obtain  the  variance  V.  It.  should  be  noted  that 
while  in  our  implementation  neighboring  points  were 
used  (the  neighborhood  subtended  about  15  degrees 
of  arc),  this  is  by  no  means  a  necessary  condition. 
A  direct  m'.nimarization  scheme  attributed  to  Helder 
and  Mead  (Nash,  1979)  was  used  to  minimize  the 
variance  V.  The  scheme  was  used  mainly  for  its 
simplicity  and  ease  of  encoding.  The  values  of 
tl0H,C0’  ani *  h0V  were  restricted  to  lie  between  ±90 
degrees  of  arc/sec  (the  "negative”  values  corres¬ 
ponding  to  counterclockwise  rotations).  This 
feasible  region  was  defined  to  restrict  the  search 


space  to  meaningful  values  and  to  prevent  possible 
divergence  of  the  iterative  process.  The  alnlmall- 
zation  procedure  converged  to  a  correct  solution 
from  any  initial  guess  within  this  feasible  region. 
Not  all  eight  distances  were  used  to  define  V. 

To  minimize  the  Influence  of  (quantization)  errors, 
the  lengths  were  ordered  In  magnitude  and  the 
two  extremal  magnitudes  were  discarded.  We  also 
tried  to  uae  the  range  of  l  (defined  as  |  tMJ{ 

-f.^1  | )  as  the  error  function  with  good  results. 

In  both  cases,  the  FOE  was  located  precisely  (using 
single  precision  arithmetic  [7  significant  digital). 
When  the  precision  with  which  the  vectors  v"  were 
defined  was  lowered  to  4  significant  digits  (the 
angular  error  made  by  this  quantization  depends  on 
the  magnitude  of  the  vector  v"  [see  Figure  7 1 ) ,  the 
FOE  was  located  within  approximately  ±5  degrees  of 
arc  of  the  correct  position.  Extensive  testing  with 
real  data  (and  using  a  more  efficient,  and  faster 
minimalization  schema)  should  be  performed  to  deter¬ 
mine  hov  :he  errors  in  v"  propagate  through  the 
computations  and  affect  the  precision  with  which 
the  FOE  can  be  obtained. 

DISCUSSION  AND  CONCLUSIONS 

It  is  important  to  realize  precisely  what  has 
been  achieved,  and  how.  Given  a  set  of  "retinal" 
vectors  v"  on  the  planar  projection  surface,  we  have 
shown  that  it  is  poaslb! e  to  extract  the  transla¬ 
tional  velocity  field,  containing  all  information 
about  spatial  disposition  of  the  texture  elements, 
solely  by  computations  using  data  available  on  the 
projection  surface  (see  Figure  8).  In  fact,  besides 
the  velocity  vectors  v",  only  the  positions  of  the 
corresponding  loci  on  the  plane  with  respect  to  a 
fixed  reference  point  (the  "fovea")  are  required. 
Another  feature  of  the  method  is  that  it  can  be 
implemented  as  a  local  computation  (the  radius  of 
the  neighborhood  would  have  to  be  large  enough) , 
and  thus  performed  at  many  "retinal"  locations  in 
parallel,  thus  decreasing  the  dependence  of  the 
method  on  a  good  initial  approximation  to  the 
parameters  hQH,  cQ,  and  hQV.  The  simplicity  of  the 
method  is  striking,  especially  in  comparison  with 
other  methods  purporting  to  achieve  the  same  results 
(e.g,,  Prazdny  1980;  see  also  Nagel  1981).  The 


Li 


method  requires  only  a  fev  points  and  the  corres¬ 
ponding  "retinal"  velocities  as  Input  (for  ex¬ 
ample,  in  the  visual  periphery,  wnich  is  apparently 
used  by  the  husmti  visual  system  to  compute  ego- 
motion  [Johansson,  1977]).  One  disadvantage  of 
the  method  is  that  it  falls  when  the  direction  of 
instantaneous  motion  is  parallel  to  PP.  In  this 
case,  the  FOE  is  undefined  (it  corresponds  to  an 
ideal  point  of  the  projective  plane).  This  is 
not  a  serious  drawback,  however.  Another  similar 
method  based  on  maximizing  the  parallelism  between 
the  vectors  defining  the  translational  field  could 
take  care  of  this  situation. 

It  is  also  Important  to  realize  that  once  the 
FOE  has  been  computed,  we  immediately  know  the 
direction  of  the  translatory  motion  on  the  pro¬ 
jection  plane  at  each  "retinal"  locus;  it  is  simply 
the  line  connecting  the  FOE  with  the  given  locus 
(on  the  "retinal"  plane).  To  obtain  information 
about  (relative)  depth  or  (local)  surface  orienta¬ 
tion,  we  need  to  compute  only  the  magnitude  of 
motion  in  this  direction;  the  two-dimensional 
problem  is  thus  reduced  to  a  more  manage  ble  one¬ 
dimensional  problem.  This  leads  directly  to  a  more 
general  schema  where  only  the  velocities  (v")  of 
a  few  "interesting  "  image  elements  (at  "prominent" 
locations  where  the  velocity  v"  can  easily  be 
detected)  are  computed  first  to  locate  the  FOE. 
Following  this  the  magnitude  of  the  translatory 
motion  at  each  image  point  would  be  computed  without 
explicit  extraction  of  the  optical  flow  (the  velo¬ 
cities  v")  Itself.  As  was  noted  by  Batali  and 
Ullman  (1979)  or  Horn  and  Schunck  (1980),  one  can 
compute,  by  a  local  computation,  only  the  velocity 
component  in  the  direction  of  the  gradient  of  the 
image  intensity  function  at  a  given  "retinal"  locus. 
But  this  is  all  we  need  if  the  FOE  were  already 
located:  by  projecting  this  velocity  component  onto 
the  direction  of  the  translatory  field  at  a  given 
image  plane  locus,  we  would  obtain  the  (relative) 
depth  information  in  its  purest  form  -  as  the  dis¬ 
tribution  of  the  magnitudes  of  the  translatory  field. 

To  summarize,  we  have  shown  how  the  direction 
of  ("dative)  motion  can  be  computed  by  a  simple 
minimization  computation  operating  on  data  available 


on  the  projection  surface.  The  method  can  be  imple¬ 
mented  locally  and  is  sloe  feasible  biologically. 
Speculatively,  perhaps,  its  operation  might  be  re¬ 
flected  in  the  recent  finding  of  the  "looming"  or 
changing-slze  channels  in  the  human  visual  pathway 
(Regan,  Beverly,  and  Cynauer,  1979);  Beverly  and 
Regan,  1979). 

NOTES 

<Note  1> 

In  general,  only  conclusions  about  relative  quanti¬ 
ties  can  be  derived  by  interpreting  optical  flows. 
Local  surface  orientation,  relative  depth  (the 
ratio  of  distances  of  two  texture  elements  in  two 
distinct  visual  directions),  and  relative  motion 
are  such  quantities. 

<Note  2> 

The  following  notational  convention  will  be  used 
throughout  the  paper,  n  denotes  a  vector,  n  is 
its  unit  vector,  and  n  is  the  norm  of  n,  i.e., 
n  -  nn.  Angular  velocities  are  conceptualized  as 
axial  vectors,  i.e.,  vectors  perpendicular  to  the 
Instantaneous  plane  of  rotation,  with  magnitude 
equivalent  to  the  angular  speed.  The  word  "retinal" 
will  be  used  to  denote  the  projection  plane,  PP. 
P(x,y)  denotes  a  "retinal"  point.  P  with  "retinal" 
coordinates  (:.c,y)*in  the  two-dimensional  coordi¬ 
nate  frame  centered  at  O'. 

<Note  3> 

The  set  of  paths  traced  by  the  image  elements  under 
this  motion  is  a  family  of  hyerbolas  with  principal 
axes  inclined  at  angle  w  with  respect  the  y-axis. 

The  family  is  symmetrical  about  a  straight  line 
through  O'.  This  is  the  line  corresponding  to  the 
intersection  of  the  plane  of  rotation  with  the  pro¬ 
jection  plane. 
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Figure  1.  Without  loss  of  generality,  the  projec¬ 
tion  plane  PP  can  be  positioned  at  unit  distance 
from  the  center  of  projection  0,  and  parallel  to  the 
yz-plane.  Any  vector  A  can  then  he  decomposed  into 
a  component  parallel  to  the  projection  plane,  and  a 
component  perpendicular  to  PP.  O'  is  the  center  of 
the  "retinal"  coordinate  frame  (2D). 


Figure  2.  When  the  observer  rotates  about  the 
vector  y,  the  path  described  by  an  image  element  P 
on  the  image  plane  PP  is  a  hyperbola. 


Figure  3.  The  observer  rotates  about  the  line 
parallel  to  y  through  0.  The  diiection  of  the 
velocity  vector  at  P(x,y)  is  determined  by  tane. 
The  projections  of  the  points  R  and  S  on  PP  have 
the  same  horizontal  velocity  components. 
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Figure  4.  The  angular  speed  of  a  point  x  is 
h  sin  (n/2  +  n) 

ho  "  JL~~7== -  *  by 

x  +1 


ilef  i  nit  ion  (|00'|*  .  From  this,  h  can  be  com¬ 
puted  directly  as  a  function  of  the  parameter  h^. 


Figure  5.  An  image  velocity  \;'  (on  the  planar  pro¬ 
jection  surface  PP)  of  a  point  P  can  be  resolved 
into  three  components.  The  hyperbolic  component 
h  is  due  to  the  rotation  of  the  ray  about  an  axis 
(through  0)  in  the  projection  plane  _£tl;e  angular 
velocity  is  a  lineair  combination  of  x  and  y).  The 
circular  component  c  is  due  to  the  rotation  of  the 
ray  about  an  axis  (through  0)  parallel  to  z.  The 
translational  component  t_  is  the  remaining  vector 
which  constrains  the  decomposition  of  v";  t  is 
constrained  to  be  such  that  VQjCTP:  (L^  intersect 
in  one  common  point).  In  the  illustration  above, 
the  direction  angle  of  the  hyperbolic  field  is 
zero,  i.e.,  the  observer  rotates  only  about  a  line 
parallel  to  the  y-axis. 


Li 


Figure  6.  To  find  the  intersection  of  L  with 
(Ij),  we  have  to  solve  for  2  in 

p+it-p.+i.t . 

— - i  i-i 

To  obtain  2  we  multiply  both  side  by  t! ,  the  perpen¬ 
dicular  to  t-j .  This  yields 


where  P  and  P.  are  the  (2D)  position  vectors  on  the 

projection  plane.  If  tj*(t  ,t  )  then  t'“(-t  ,t  ). 

x  y  y  x 


Figure  7.  The  quantization  error  increases  with 
decreasing  vector  magnitude.  At  *>,  an  error  of 
about  10  degrees  of  arc  is  made  by  representing 
v  as  v!,  while  at  a,  such  a  representation  results 
in  an  error  of  45  degrees  of  arc. 
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ground  plane)  from  a  surface  (pUneJ  si antef lO^egrees' ’(toward" observer)  1r^°the  JerUcal^1 


rn  by  the  <««*».  ». 

tudes  of  the  velocity  vectors;  the  direction  of  the  vectors  (and^he  FOE)°dtainad  tl>*  magnl" 

meters  of  the  relative  Motion.  vectors  (and  the  FOE)  depend  •  on  the  para- 


Figure  8 
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relaxation  matching  applied  to  aerial  images 
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INTRODUCTION 

We  have  developed  a  symbolic  matching  system 
which  can  be  used  for  a  variety  of  matching  tasks 
in  scene  analysis.  The  system  is  designed  to 
handle  many  of  the  problems  encountered  in  the 
analysis  of  real  scenes:  noisy  feature  values, 
missing  elements,  extra  pieces  of  objects,  many 
features,  many  objects.  At  the  heart  of  this 
system  is  a  relaxation  based  matching  scheme.  A 
variety  of  relaxation  procedures  have  been  used 
with  varying  results. 

The  basic  matching  procedure  (1]  and  the 
various  relaxation  methods  [2,3]  are  discussed 
elsewhere  and  will  only  be  outlined  here.  This 
paper  will  concentrate  on  a  discussion  of  the 
overall  matching  system  and  the  performance  of  the 
various  relaxation  techniques. 

The  input  to  the  matching  procedure  is  two 
relational  str.xtures-one  for  the  model  of  the 
scene  and  the  other  for  the  input  image.  The 
structures  are  represented  as  graph  structures 
with  objects  at  the  nodes  (with  associated  feature 
values)  and  relations  between  objects  as  the  arcs 
between  the  nodes,, 

SYMBOLIC  MATCHING  PROCEDURE 

The  goal  of  the  matching  procedure  is  to  find 
the  objects  in  the  image  (regions  and  lines)  which 
best  match  the  objects  given  in  the  model.  This 
is  essentially  finding  a  subgraph  in  the  image 
which  is  isomorphic  to  the  model,  except  that 
objects  may  be  missing  and  single  nodes  in  the 
model  may  correspond  to  several  in  the  image  when 
objects  are  broken  apart  by  the  segmentation 
procedures. 

The  matching  procedure  is  divided  into  two 
iterations  (see  Pig.  1).  The  outer  loop  consists 
of  computing  initial  estimates  of  the  assignments 
for  all  elements  in  the  model.  Up  to  30  potential 
assignments  for  each  element  are  maintained.  Then 
a  relaxation  procedure  is  applied  which  updates 
the  ratings  of  the  assignments.  When  the  rating 
for  one  assignment  of  an  element  in  the  model 
exceeds  a  threshold,  the  relaxation  update  loop  is 
terminated  and  all  assignments  above  the  threshold 
are  made  permanent.  Then  the  process  continues 
with  the  recomyutation  of  the  initial  estimates, 
but  now  relations  (above,  near,  adjacent)  with  the 


permanently  assigned  elements  can  be  used  in  the 
computation. 

The  repeated  initialization  is  a  crucial 
component  of  the  process.  Since  the  initial 
guesses  are  made  only  on  the  basis  of  feature 
value  (color,  shape,  texture,  etc.)  many  incorrect 
assignment  are  initially  highly  rated.  The 
relaxation  steps  eliminate  many  of  these  mistakes, 
but  cannot  correct  all  of  them  because  some 
correct  assignments  are  not  among  the  early 
candidates.  This  procedure  also  easily  allows  for 
multiple  segments  in  the  image  to  be  assigned  to 
one  elanent  in  the  model. 

The  final  termination  condition  is  the  number 
of  iterations  without  any  assignments  reaching  the 
threshold.  This  number  must  be  large  enough  so 
that  valid  assignments  can  reach  the  threshold  but 
not  so  large  that  an  incorrect  assignment  is 
forced,  by  default,  to  a  large  value.  This  is 
especially  true  with  our  primary  relaxation  method 
[1]  where  something  is  always  forced  to  a  high 
value  (how  rapid  can  be  controlled  and  we  use  a 
relatively  fast  setting) . 

Several  different  relaxation  updating  schemes 
can  be  used  in  the  inner  loop.  The  simplest 
technique  is  the  "classical"  method  of  ftosenfeld, 
Hummel,  and  Zucker  [2],  Gur  primary  technique  [1] 
is  similar,  except  that  the  updating  is  always  in 
a  direction  which  improves  a  global  criterion.  A 
third  method  is  that  of  Kitchen  [3]  which  provides 
a  different  means  of  combining  match  rating  from 
different  properties.  See  the  Appendix  for  a 
summary  of  the  relaxation  updating  functions  for 
these  methods. 

RESULTS 

The  various  methods  were  tested  on  3 
different  aerial  views  (1)  a  scene  with  14  storage 
tanks,  of  5  different  sizes,  (2)  a  high  altitude 
view  of  the  San  Francisco  area  with  14  objects 
identified  in  the  model  (95  in  the  image) ,  and  (3) 
a  view  of  the  Stockton,  CA.  area  with  20  model 
objects  (24  or  more  valid  matches  possible,  and 
more  than  200  image  elements) .  The  results  on 
these  images  allow  us  to  make  some  general 
comments  on  the  performance  of  the  various 
methods . 
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The  basic  technique  of  rtosenfeld  et  al .  [2] 

quickly  makes  several  assignments  in  eacli  of  the 
test  scene,  but  then  reaches  a  stable  state  with 
low  probabilities  for  the  most  likely  assignment. 

The  method  of  Kitchen  [3]  allows  for  a 
variety  of  combining  metlwds  (by  using  different 
functions  for  the  fuzzy  set  operations) . 
Additionally,  restricting  the  computation  to  use 
only  the  one  most  likely  assigiment  of  neigliboring 
(related  elements  reduces  the' time  substantially 
and  improves  the  performance. 

The  results  of  running  the  various  relaxation 
methods  are  presented  in  Table  1.  Tine  set  of 
assignments  shown  n  Figs.  2,  3,  and  4  are  the  ones 
generated  by  the  criteria  optimization  method 
described  fully  in  [1]  (FP  in  Table  1).  K-l  is 
the  Kitchen  method  [3]  which  uses  MIN,  MAX,  and 
the  mean  for  it,  0,  and  the  outer  II  (Form  4  in  his 
paper)  ,  the  use  of  MIN  for  the  outer  (Form  1  in 
his  paper)  produced  no  assignments  in  our  tasks. 
K-2  uses  product  instead  of  MIN  (Form  5).  K-3  is 

the  same  as  K-l  exact  all  possible  assignments  of 
neighbors  are  considered  rather  than  the  one  most 
likely  assignment.  Kitchen's  Form  -3  produced  the 
same  results  on  our  tasks  as  K-2.  RHZ  is  the 
classical  Rosenfeld  et  al.  metiwd  [2]  and  is 
included  as  a  historical  reference  point. 

The  threshold  for  forcing  a  permanent 
assignment  is  0.73  (for  the  Kitchen  algorithms  the 
values  were  normalized  only  for  this  test)  and  15 
iterations  were  run  before  terminating  the 
procedure  due  to  lack  of  assignments.  In  tests 
with  our  algoritlun,  changing  the  0.75  threshold 
(between  about  0.7  and  0.8)  by  small  anounts  has 
little  or  no  effect.  Increasing  it  requires  more 
iterations  and  thus  more  time  and  some  assignments 
may  be  lost.  Hie  results  include  only  those 
assignments  which  exceeded  the  threshold  (0.75) 
and  does  not  include  the  most  likely  assignments 
at  the  time  of  termination  (15  iterations)  . 
Including  all  these  would  increase  the  number  of 
correct  ones  with  no  clear  separation  in 
likelihood  values  between  correct  and  incorrect 
ones. 


Third,  our  method  performs  better  than  the 
others  in  these  tasks.  One  major  reason  is  the 
optimization  procedure  used  to  guide  the  updating 
thus  some  assignments  are  discovered  early  and 
then  contribute  substantially  in  the  search  for 
further  matches.  Also,  in  our  system,  all 
relations  and  features  contribute  to  the  rating 
rather  than  only  the  best  or  worst  as  with  Kitchen 
using  MIN  and  MAX. 

Fourth,  the  17  incorrect  assignment  by  K-2 
for  Scene  #1  start  after  13  correct  matches  are 
located.  This  is  accounted  for  by  the  way  we 
handle  multiple  assignments  for  model  elements  and 
by  the  fact  that  multiple  matches  for  image 
elements  are  only  discouraged  not  forbidden.  The 
change  from  MIN  to  product  (between  K-i  and  K-2) 
also  contributes  to  the  problem. 

The  updating  approach  adopted  by  Faugeras  [1] 
clearly  performs  better  and  operates  faster.  Each 
iteration  of  this  program  takes  longer,  but  fewer 
are  required. 
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APPENDIX 

We  only  present  a  summary  of  the  equations 
for  the  relaxation  updating  the  details  are 
contained  in  the  appropriate  papers.  Therefore 
many  of  the  terms  will  not  be  fully  explained. 
The  classical  Rosenfeld  et  al.  method  [2)  is: 


Several  comments  can  be  made  from  these 
results.  First,  Kitchen's  metliod  was  not  designed 
to  fit  into  our  matching  system  and  is  not 
oriented  toward  quickly  producing  unambiguous 
resjlts  for  a  few  of  the  elements  in  the  graph. 
But  this  feature  is  necessary  when  dealing  with 
problem  domains  such  as  these  (i.e.  many  feature 
values,  many  elements,  similar  objects,  and  noisy 
data) . 

Second,  considering  more  alternatives  for 
neighbors  does  not  improve  performance,  but 
decreases  it  with  a  substantial  increase  in  time, 
this  was  not  totally  obvious  without  experiments) 
since  the  likelitoods  of  the  second,  third  and 
other  alternations  are  much  less  than  the  most 
likely  one  they  should  contribute  little,  it  any, 
to  the  computation.  Tests  with  the  other  methods 
(K-2,  FP)  give  similar  results  (decreased 
performance  increased  time)  . 


_  (n+1)  _ 
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Pi(n^)  is  the  likelihood  of  assignment  foi  unit  i 
to  name  k,  Q  is  a  measure  of  the  compatibility  of 
the  assignment  with  assignnents  of  neighboring 
units.  N  is  the  set  of  all  possible  names  (image 
elements)  . 


The  Kitclien  updating  method  [3]  is  (rewritten 
for  our  problan) 


p‘n+1<y  - 


(2) 
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v*iere  and  are  the  fuzzy  logic  AND  and  OR 
operations.  C  is  a  local  consistency  measure 


srtiich  is  also  used  to  compute  the  Q's  in  the 
equations  above  and  below.  The  generality  of  the 
original  formulation  is  not  retained  -  only 
features  of  single  objects  and  relations  between 
tv«  objects  are  given  -  not  the  arbitrary  nunber 
of  the  original. 


The  third  method  of  Faugeras  [1]  is: 


p.(n+1)  =  p!n)  +  p  P.{g.(n)) 
^i  *1  ’  n  1  •  Ji 


vihere  nn  is  a  positive  number  (step  size)  to 
control  the  speed.  Pi  is  a  projection  operator  to 
maintain  the  constraint  that  pi  is  a  probability 
vector,  and  gi  is  a  gradient  function  computed 
from  the  compatibility  measures  and  current 
probability  values  for  an  oojen  and  its 
neighbors. 


Method 

Scene 

Nunber 

Correct  Incorrect 

Not  Assigned 

Tima 

K-l 

l(Tank  Farm) 

12 

0 

2 

5 

K-2 

1 

14 

17 

0 

30 

K-3 

1 

0 

0 

14 

- 

RHZ 

1 

14 

28 

0 

29 

FP 

1 

14 

0 

0 

4 

K-l 

2 (San  Francisco)  7 

0 

7 

19 

K-2 

2 

9 

0 

5 

28 

K-3 

2 

7 

0 

7 

4:50 

RHZ 

2 

8 

11 

6 

53 

FP 

2 

14 

0 

0 

9 

K-l 

3 (Stockton) 

9 

0 

11 

1:25 

K-2 

3 

16 

4 

5 

3:00+ 

K-3 

3 

6 

0 

14 

6:00+ 

RHZ 

3 

16 

11 

6 

2:30+ 

FP 

3 

24 

1 

0 

1:00 

Table  J .  Relaxation  results. 


Fig.  1.  Use  of  the  relaxation  procedure 
in  the  overall  matching  system. 
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Abstract 

An  intermediate  level  vision  system  that  utilises 
grey  scale  levels  (typically  8  hits  or  2!»6  levels  in  our  case) 
has  been  implemented  which  locales  and  links  intensity 
discontinuities  in  a  digitized  image  to  subpixel  precision. 
The  discontinuities  are  located  and  localised  by  utilizing 
tin'  zero  crossings  in  the  laterally  inhibited  image  of  the 
digitized  pictl  .'O. 

Introduction 

As\  has  recently  been  pointed  out  in  earlier  work 
in  the  literature  (Nevatia  &  Bahu  1978),  the  effectiveness 
of  many  machine  vision  systems  is  often  limited  by  the 
low  level  processing  that  constitutes  the  first  stage  of  the 
system.  Typically,  this  stage  consists  of  operations  such 
as  edge  detection,  thinning,  thresholding,  and  linking-in 
other  words-  line  finding. 

Given  this  current  state  of  the  art  and  the  inspira¬ 
tion  of  earlier  Mi  l  work  (Binford-IIorn  1972,  Binford 
1970,  llerskovitz  &  Binford  1970),  we  have  undertaken  the 
task  of  seeking  an  improve  moot  to  this  stage  of  vision  sys¬ 
tems.  The  processing  reported  here,  a  simplification  of  the 
Binford-IIorn  approach,  differs  markedly  from  most  sys¬ 
tems  reported  in  the  literature  but  has  similarities  to  some 
current  Mi  l'  systems  (Marr  &  Hildreth  1979,  Grimson 
1980).  Like  the  MIT  systems  of  Binford-IIorn  and  Marr- 
llildrcth,  our  system  first  applies  lateral  inhibition  to  the 
image,  then  follows  this  step  by  detecting  significant  signal 
from  gradient  magnitude  (contrast),  and  finally  localizes 
the  step  by  zero- crossing  of  the  lateral  inhibition  signal. 

Operators  such  as  Nevatia  and  Hahu  arc  essen¬ 
tially  gradient  operators.  The  gradient  lias  a  broad  max¬ 
imum  at  an  edge.  Such  operators  require  a  thinning 
process  or  a  process  of  maximum  selection,  with  degraded 
resolution.  Thresholding,  too,  has  degraded  resolution 
from  that  demonstrated  here  (Binford  81).  Intensities  are 


weighted  averages  over  a  pixel  area.  That  introduce  a 
spatial  smearing  from  the  discrete  sampling.  The  gradient 
has  a  maximum  which  cannot  be  located  closer  than  the 
nearest  integer  pixel, 0.29  pixel  rms  error,  and  with  effect* 
of  thinning,  can  have  worse  accuracy.  The  lateral  inhibi¬ 
tion  signal,  equivalent  to  a  second  derivative,  has  a  zero 
at  the  step,  which  can  be  interpolated  through  zero  with 
an  accuracy  which  depends  on  the  signal  to  noise  ratio. 

The  significance  of  greater  precision  is  that  it 
makes  full  use  of  the  inherent  information  of  the  signal. 
In  eases  in  which  thresholding  or  gradient  operators  might 

Lateral  Inhibition 

Like  Binford-Horn  but  uniike  the  MIT  approach, 
we  convolve  the  image  with  a  square  mask  of  side  2n-(-l  to 
yield  the  local  average  intensity  (over  the  2n-(-  1  x  2n-b  1 
area)  and  subtract  this  average  from  the  intensity  of  the 
pixel  at  the  centre  of  the  square.  The  resulting  intensity 
is  the  laterally  inhibited  value  of  the  image  at  the  central 
pixel.  An  example  of  this  is  presented  in  Figure  2  using 
as  input  the  image  shown  in  Figure  l.  It  is  this  value  that 
is  used  for  location  of  the  zero  crossings.  In  the  process 
of  calculating  the  latcially  inhibited  image,  linear  inten¬ 
sity  functions  (including  the  special  case  of  the  constant 
function)  are  mapped  into  the  zero  value.  An  intensity 
discontinuity  is  characterised  by  values  that  rise/fall  to  a 
maximum  over  n  pixels,  a  switch  to  a  maximum  of  the 
opposite  sign,  and  a  fail/risc  to  zero  again  over  n  pixels. 

Although  it  is  possible  to  convolve  the  image  with 
masks  that  utilise  more  than  a  central  pixel,  say  a  2x2 
or  a  3x3  central  window,  from  which  the  local  average 
(the  2 n  +  k  average)  is  subtracted,  we  have  limited  our 
investigation  to  the  use  of  a  single  central  pixel.  It  yields 
a  good  result  for  the  eases  studied  to  date. 

As  is  to  be  expected,  when  two  discontinuity.:  are 
separated  by  less  than  the.  mask  dimension  there  will  be 
interference  which  will  result  in  locational  errors.  While 
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A  Lockheed  1,1011  parked  at  the  Van  /'Yanctsco  Airport 
Figure  1 


u  small  musk  will  minimise  this  cITect,  it  will  be  moic 
sensitive  to  noise  and  the  errors  induced  by  it. 


Discontinuity  Point  Detection 

It  is  the  zero  crossing  that  occurs  during  the  switch 
from  one  maxima  to  the  other  that  is  the  zero  crossing  of 
interest  since  this  zero  corresponds  to  the  location  of  the 
discontinuity. 

For  the  case  of  subtracting  the  value  of  the  single 
central  pixel  from  the  local  average,  this  switch  from  peak 
to  peak  occurs  over  a  2  pixel  interval.  It  ir  'he  location  of 
the  'MX o  of  this  switch  that  is  Ihe  zero  crossing  of  interest. 

The.  exact  position  of  the  crossing  was  taken  to 
be  the  linearly  interpolated  position  between  the  pixels 
obtained  using  the  values  of  intensity  on  either  side  of  the 
crossing. 

Zero  point  crossings  were  calculated  in  both  the 
horizontal  and  vertical  directions  for  the  processed  image. 


Discontinuity  Point  Linking 

The  zero  points  or  zero  crossings  of  the  laterally  in¬ 
hibited  image  were  linked  together  over  a  mesh  of  dimen¬ 
sion  equal  to  a  single  pixel  separation.  Decisions  on  the 
linking  were  based  upon  the  intensity  values  of  the  corners 
of  the  mesh  which  were  four  pixel  values.  Use  of  measured 
picture  noise  as  a  threshold  to  reject  spurious  crossing 
was  accompanied  by  the  simultaneous  rejection  of  inten¬ 
sity  continuities  that  marked  the  edges  of  shadows.  In 
an  effort  to  overcome  this  problem  and  improve  rejection 
of  satellite  crossings  resulting  Irom  the  lateral  inhibition 
process  a  structure  filter  was  added.  This  filter  examines 
the  twelve  intensity  points  artund  the  four  of  the  central 
mesh  and  rejects  zero  crossings  that  are  of  extent  less  iflan 
four  pixels.  The  filter  did  not  pass  points  unless  they 
belonged  to  trajectories  that  ex  ted  from  the  3x3  pixel 
area  containing  the  central  mesh. 
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I'he  result  of  applying  lateral  inhibition  to  Figure  1 
Figure  2 


Results 

A  aero  crossing  was  deemed  to  occur  along  the  edge 

of  a  mesh  if  one  corn1  of  the  mesh  edge  was  above  aero  Figure  1  (512  x  512  x  8  bits)  depicts  an  aeroplane 

and  the  other  corner  was  below  aero  these  2  corner  values  °n  a  runway.  We  have  chosen  this  picture  to  illustrate  our 

were  then  used  to  calculate  location  along  the  edge  of  method  since  results  on  this  image  have  been  presented 

the  crossing.  Intramesh  points  were  established  using  the  elsewhere  in  the  literature  and  it  contains  many  objects 
average  of  the  four  corner  points  as  the  value  at  the  center  °f  interest  for  this  type  ol  processing. 


of  the  mesh.  The  position  of  the  ?,ero  crossing  was  then 
calculated  to  lie  between  the  centre  point  and  one  of  the 
corner  points. 


The  result  of  operating  on  this  image  with  a  5  x 
5  lateral  inhibition  operator  yields  the  image  presented  in 
figure  2. 


Ujing  the  fact  that  a  zero  crossing  point  is  one  of 
a  continuous  trajectory  of  such  points,  the  decision  as  to 
which  points  must,  be  joined  can  easily  be  determined.  If 
it  is  assumed  that  the  calculated  value  for  the  renter  of 
the  mesh  is  accurate  at  least  in  so  far  as  to  whether  or  not 
it  is  above  or  below  the  zero  value,  this  information  can 


When  the  zero-crossings  in  Figure  2  having  a  con-  j 

tiast  of  more  than  2  intensity  units  are  found  and  linked  j 

we  obtain  an  outline  figure  for  the  laterally  inhibited  j 

image.  Such  a  figure  produced  in  this  way  is  presented  I  ; 

in  Figure  3.  j  ( 

i  ( 


be  used  to  separate  two  trajectories  that  pass  through  a 
single  pixel.  An  example  of  this  is  to  be  found  in  Figure 
5. 


By  expansion  or  windows  of  Figure  3,  we  can  il¬ 
lustrate  the  precision  of  our  method.  These  expansions 


are  presented  in  Figures  4  &  5. 


Linked  zero-crossing  points  of  laterally  inhibited  image 
Figure  3 


It  is  interesting  to  compare  the  accuracy  of  the 
results  obtained  by  the  method  presented  here  with  known 
values.  Data  from  Lockheed  documents  indicates  that  the 
wing  span  for  models  -1,  -100,  and  -'200  is  47.35  m  and  that 
the  wing  span  for  model  -500  is  50.08  m.  For  all  models  the 
tail  span  is  21.82  m.  A  convenient  measure  of  the  accuracy 
then  is  the  ratio  of  the  wing  span  to  tail  span  which  for 
the  data  above  is  0.4619  and  0.4367  respectively.  Using 
expanded  images  for  the  wind  and  tail  tips,  such  as  shown 
in  Figure  5,  a  ratio  of  0.4636  was  obtained  which  agrees 
with  the  first  ratio  within  0.4%and  with  the  second  ration 
to  within  6.2%.  We  may  therefore  conclude  that  the  image 
Was  one  of  the  former  models.  This  result  would  suggest 
that  different  planes  could  except  for,  perhaps,  some  very 
special  cases,  be  easily  distinguised  and  identified  using 
the  technique  reported  here. 


Discussion  and  Conclusions 

As  the  results  show,  our  processing  produces  rather 
improved  results  over  those  that  have  been  obtained  using 
other  methods.  The  structure  filter  was  most  effective  in 
enhancing  the  fine  structure  to  be  found  in  the  middle 
right  hand  side  of  the  image.  Improvements  in  this  ap¬ 
proach  are  being  pursued. 

For  a  comparison  of  the  results  shown  here  with 
other  methods  using  the  same  image  as  input  sec  Brooks 
(1979)  for  results  using  the  Ncvatia-Babu  technique  and 
Arnold  (1978)  for  the  edges  obtained  using  a  Heuckel 
operator. 

For  comparison  with  recent  results  using  other 
images  and  techniques  currently  available  in  the  literature, 
see  the  results  of  Rosenfeld  (1979)  and  Tavakoli  (1980). 


Expansion  of  a  window  of  Figure  $  (128  x  128) 
_ _ _  Figure  4  J 


Expansion  of  a  window  of  Figure  9(16x16) 


We  are  currently  extending  this  work  to  provide 
improved  ‘real  picture’  data  for  some  ol‘  the  other  image 
processing  projects  being  studied  in  our  laboratory  such 
as  Arnold’s  stereo  work,  ACRONYM  (Brooks  1979),  and 
Lowe’s  work  on  geometric  modeling  (these  Proceedings). 
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ABSTRACT 

When  an  image  is  smoothed  using  small  blocks  or 
neighborhoods,  the  results  may  be  somewhat  unreliable 
due  to  the  effects  of  noise  on  small  samples.  When 
larger  blocks  are  used,  the  samples  become  more  re¬ 
liable,  but  they  are  more  likely  to  be  mixed,  since 
a  large  block  will  often  not  be  contained  in  a  single 
region  of  the  image.  A  compromise  approach  is  to  use 
several  block  sizes,  representing  versions  of  the 
image  at  several  resolutions,  and  to  carry  out  the 
smoothing  by  means  of  a  cooperative  process  based 
on  links  between  blocks  of  adjacent  sizes.  These 
links  define  "block  trees"  which  segment  the  image 
into  regions,  not  necessarily  connected,  over  which 
smoothing  takes  place.  In  this  paper,  a  number  of 
variations  on  the  basic  block  linking  approach  are 
investigated,  and  some  tentative  conclusions  are 
drawn  regarding  preferred  methods  of  initializing 
the  process  and  of  defining  the  links,  yielding 
improvements  over  the  originally  proposed  method. 


INTRODUCTION 

Suppose  that  an  image  is  composed  of  a  few  types 
of  regions  each  having  approximately  constant  gray 
level.  In  principle,  the  image  can  be  segmented  into 
these  regions  by  gray  level  thesholding,  i.e.,  by 
slicing  the  grayscale  into  intervals,  and  classify¬ 
ing  each  pixel  according  to  the  interval  in  which 
its  gray  level  lies.  However,  if  the  image  i3  noisy, 
this  pixel -by-pixel  segmentation  process  may  make 
many  errors,  since  the  noise  will  cause  some  of  the 
pixels  belonging  to  one  type  of  region  to  have  gray 
levels  lying  in  the  intervals  corresponding  to  another 
type.  Segmentation  could  become  more  reliable  if  we 
first  smoothed  the  image  to  reduce  its  noisiness. 

An  image  can  be  smoothed  by  local  averaging,  i.e., 
averaging  the  gray  level  of  each  pixel  with  the  gray 
levels  of  a  set  of  Its  neighbors.  However,  this  pro¬ 
cess  will  blur  the  boundaries  between  the  legions, 
since  a,  pixel  near  ouch  a  boundary  has  neighbors 
lying  in  both  its  own  region  and  the  adjacent  region. 
If  we  knew  which  neighbors  belonged  to  the  same 
regions  as  the  pixel,  we  could  use  only  these  neigh¬ 
bors  in  the  average.  In  other  words,  the  quality 
of  the  smoothing  process  would  be  improved  if  we 
could  first  segment  the  Image  into  the  appropriate 
regions.,  so  that  smoothing  could  be  performed  within 
the  regions  only,  not  across  their  borders. 


These  remarks  suggest  that  it  might  be  prefer¬ 
able  to  perform  smoothing  and  segmentation  con¬ 
currently,  using  some  type  of  cooperative  process. 

An  example  is  the  combined  smoothing  and  neighbor 
linking  process  defined  in  [1],  Here  weights  are 
assigned  to  the  links  between  a  pixel  and  its 
neighbors  based  on  their  similarity;  the  image  is 
smoothed  by  weighted  averaging  of  each  ilxel  with 
its  highest-weighted  neighbors;  concurrently,  the 
weights  are  adjusted  as  the  similarities  between 
neighbors  change.  The  process  is  iterated,  with 
weighted  averagin'*  and  weight  adjustment  alternat¬ 
ing.  Note  that  this  process  does  not  involve  class¬ 
ification  of  the  pixelu,  but  does  yield  a  segmenta¬ 
tion  of  the  image  into  regions  based  on  the  connect¬ 
edness  relation  defined  by  the  links,  it  we  threshold 
their  weights. 

This  paper  deals  with  another  approach  to  con¬ 
current  smoothing  and  segmentation  based  on  linking, 
using  versions  of  the  image  at  different  resolutions 
and  defining  links  between  overlapping  "pixels"  at 
successive  resolutiono.  In  a  low-resolution  image, 
the  pixels  interior  to  regions  have  gray  levels 
that  are  less  noisy,  since  a  pixel  at  low  resolu¬ 
tion  represents  an  average  and  i3  thus  less  vari¬ 
able.  On  the  other  hand,,  the  lower  the  resolution, 
the  less  likel>  it  is  that  a  pixel  is  contained 
in  a  single  region;  most  pixels  will  overlap  two 
or  more  regions.  The  approach  considered  Imre, 
which  was  first  described  in  [2],  takes  advantage 
of  both  high  and  low  resolutions  by  using  a  coop-- 
erative  process  in  which  the  images  of  uuccesuive 
resolutions  interact.  A  detailed  description  of 
this  approach  will  be  given  in  Section  2.  a  njmber 
of  variations  on  the  multilevel  approach  have  been 
investigated,  involving  changes  in  the  initiali¬ 
zation  of  the  process,  the  method  of  defining  links, 
and  the  iteration  sequencing;  these  are  described 
in  Section  3. 

MULTIRESOLUTION  PIXEL  LINKING 

Let  the  size  of  the  original  image  by  2n  by  2n. 

To  define  the  reduced-resolution  versions  of  the 
image,  we  make  use  of  an  exponentially  tapering 
"pyramid"  of  arrays  of  sizes  2n~*  by  2n_1,  2n“”  by 
2i>“2,...  4  by  4,  2  by  2,  so  that  the  kth  level  has 
size  2n-k  by  2n-'t.  To  avoid  border  effects,  all 
these  arrays  are  regarded  as  cyclically  closed, 
i.e.,  the  first  column  is  regardii  as  lying  to  the 
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right  of  the  last  column,  and  the  top  row  below  the 
bottom  row.  The  elements  of  each  array  will  be 
called  pixels  or  nodes.  Many  different  schemes 
can  te  defined  for  constructing  such  pyramids  [3], 
but  in  our  experiments  ve  used  only  the  simple 
scheme  that  will  now  he  described. 

We  will  assign  gray  levels  to  the  nodes  at  each 
level  (kX))  by  taking  (weighted)  averages  of  the  gray 
levels  of  4-by-4  blocks  of  nodes  at  the  level  below 
it.  The  blocks  corresponding  to  adjacent  nodes 
overlap  by  30T;  this  is  why  the  reduction  in  size 
from  level  to  level  is  by  a  factor  of  2,  not  a 
factor  of  4.  For  example,  suppose  node  (i,j)  at 
level  kXl  corresponds  tc  the  block  of  nodes 


(u,v) 

(u+l,v) 

(u+2 ,v) 

(u+3,v) 

(u.v-l) 

(u+l.v-l) 

(u+2,v-l) 

(u+3,v-l) 

(u,v-7) 

(u+l,v-2) 

(u+2,v-2) 

(u+3,v-2) 

(u,v-3) 

(u+l,v-3) 

(u+2,v-3) 

(u+3,v-3) 

at  level  k-1  (where  (u,v) 
(i+l,J)  corresponds  to  the 

-  (?i-l  ,2J+1) ) .  Ther 
block 

(u+2,v) 

(u+3, v) 

(u+4,v) 

(u+5  ,v) 

(u+2,v-l) 

(u+3,v-l) 

(u+4,v-l) 

(u+5,v-i) 

(u+2 ,v-2) 

(u+3,v-2) 

(u+4,v-2) 

(u+5,v-2) 

(u+2,v-3) 

(u+3,v-3) 

(u+4,v-3) 

(u+5,v- 3) 

where  ell  additions  and  subrractione  are  modulo  2'c_1. 
It  Is  easily  seen  that  any  node  (u,v)  below  the  top 
level  (i.e.,  k<n-l)  belongs  to  four  blocks  corres¬ 
ponding  to  nodes  on  th.t  level  above  it  -  in  our 
example,  the  nodes  (1  ,.j) ,  (1-1,  j) ,  (i,  j+1) ,  and 
(1-1, J+l).  [Note  that  only  for  the  last  cf  these 
nodes  does  (u,v)  belong  to  the  center  2-1/-2  portion 
of  its  block;  for  the  other  three,  (u,v)  is  a  border 
point  of  their  t  lc-cks.  J  The  level  k-1  nodes  in  che 
block  corresponding  to  a  given  node  at  level  k  will 
be  called  its  sons.,  and  the  level  k  nodes  to  whose 
blocks  a  given  node  at  level  k-i  belongs  will  be 
called  its  fathers.  Thus  every  node  at  level  >  0 
has  16  sors,  and  every  node  at  level  <  n-1  has  four 
fathers.  Not*  that  since  there  arc  only  IS  nodes 
at  level  n-2,  each  of  them  is  a  son  of  all  four 
nodes  at  level  n-1,  so  that  every  node  in  the  pyra¬ 
mid  Is  a  descendant  of  every  one  of  these  "top"  nodes. 

The  node  linking  process  is  as  follows:  tne 
reduced  resolution  images  are  Initially  defined  by 
unweighted  averaging  of  the  gray  levels  in  each 
block.  The  gray  level  of  each  node  Is  then  compared 
with  the  levels  of  its  four  fathers,  and  a  link  is 
established  between  the  node  and  Its  most  similar 
father,  i.e.,  the  father  whose  level  is  closest  to 
the  node's  level.  After  this  has  heen  done  at  every 
level,  we  recompute  the  gray  level  of  each  father  by 
averaging  only  those  sons  that  arc  linked  to  it. 

(If  no  sons  are  ’inked  to  a  father,  we  give  it  "gray 
level"  zero.)  Rased  on  these  new  averages,  a  node's 
most  similar  father  may  have  changed,  so  wv.  next 
change  the  links  as  necessary,  then  recompute  the 
averages,  then  change  the  links  again,  and  so  on. 
Typically,  this  process  stabilizes  after  a  few  itera¬ 
tions. 


To  see  her.  !»■  pro  s*  works,  lei  ua  define 
the  base  of  a  do  „e  as  t'  set  of  pixelu  that  are 
linked  (through  as  many  -  .tenseuiate  stages  as 
necessary)  to  tint  c  * .  Thus  initially  the  base 
of  every  node  la  a  squire  block  of  pixels.  If  the 
base  of  a  node  initially  lies  mostly  Inside  a 
region,  the  node  is  most  likely  to  become  linked 
to  nodes  on  the  level  below  that  rdac  lie  (mostly) 
in  that  region;  thus  its  raemnputed  overage  will 
become  closer  to  the  region  average.  As  the  pro¬ 
cess  is  Iterated  .  nodes  at  relatively  high  levels 
acquire  values  that  approach  the  average  values  of 
regions,  even  though  they  are  too  large  to  fit  into 
a  region.  Slight  initial  biases  in  the  node  averages 
at  high  levels  will  result  lr.  high-level  nodes  being 
driven  toward  value?  that  correspond  cloaely  with 
the  averages  of  regions  or  sets  of  similar  regions 
in  the  image.  For  further  discussion  of  the  process, 
see  [2]. 

Supposed  that  there  are  not  more  than  four  types 
of  regions  in  the  image.  For  each  type,  there  should 
be  at  least  one  node  at  the  top  level  of  the  pyramid 
whose  average  converges  to  the  average  gray  level 
of  the  regions  of  that  type.  This  node  will  be 
linked  to  nodes  which  are  linked  to  nodes  . . .  which 
are  linked  to  the  pixels  belonging  to  theae  regions. 

In  other  words,  this  node  becomes  the  root  of  a 
tree  whose  leaves  are  the  pixels  that  lie  in  regions 
of  the  given  type.  If  there  are  fewer  than  four  types 
of  regions,  there  may  be  two  euch  trees  corresponding 
to  the  same  region  type,  representing  different  sub¬ 
sets  of  the  pixels  in  these  regions.  If  we  know  how- 
many  region  types  there  are  supposed  to  be,  we  can 
suppress  some  of  the  nodes  at  the  top  level  (i.e., 
forbid  anyone  to  link  to  them) ,  keeping  only  as  many 
top-level  nodes  as  there  are  types.  (Alternatively, 
we  can  "merge"  some  of  the  top-level  nodes  together, 
averaging  together  their  values  and  using  this  average 
as  the  value  fur  each  of  them.]  In  this  way,  we  can 
insure  that  the  number  of  trees  (having  distinct 
values)  is  the  same  as  the  desired  number  of  region 
types. 

In  summary,  the  Iterative  linking  and  averaging 
process  is  defined  as  follows: 

a)  l.iltlc]  ice  the  node  values  by  simple  block 
averaging  oi  each  node's  16  sons 

b)  Link  each  node  to  that  one  of  its  four 
fathers  whose  value  is  closent  to  its  own 

c)  ?.ecom?ute  the  node  values  by  averaging  the 
values  of  only  those  sons  that  are  linked 
to  the  node 

d)  Change  the  links  in  accordance  with  these 
new  values 

e)  Repeat  steps  (c-d)  as  many  times  as  desired. 
Typically,  there  is  little  change  after  the 
first  few  iterations,  and  there  is  tio  change 
at  ell  after  10  or  15  iterations. 

At  any  stage  of  this  process,  the  links  define  a 
set  of  (up  to  four)  trees  rooted  at  the  top  level 
of  the  pyramid,  ard  we  associate  with  each  pixel 
the  value  at  the  root  of  its  tree.  Thus  the  process 
smooths  the  image  to  an  extreme  degree,  giving  each 
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pixel  its  tree  average  as  a  smoothed  gray  level. 

At  the  same  time,  it  segments  the  image  into  (up 
to  four)  subsets,  where  each  subset  consists  of 
the  pixels  which  are  the  leaves  of  one  of  the  trees. 

The  smoothing  and  segmentation  accomplished  by 
this  process  can  be  compared  with  those  achieved  by 
the  pixel  linking  process  of  [1].  In  [1)  the 
links  are  all  at  the  pixel  level,  and  the  smoothing 
lb  local.  Even  if  the  link  strengths  all  converged 
to  values  of  1  (within  a  region)  and  0  (between 
regions),  many  iterations  would  be  required  to 
obtain  the  global  average  of  each  region  at  each 
pixel  of  the  region,  since  it  takes  0  (region  dia¬ 
meter)  iterations  for  Information  to  propagate 
across  the  region.  In  the  process  described  here, 
on  the  other  hand,  the  links  are  between  levels, 
and  information  can  propagate  "across"  a  region  in 
0  (log  region  diameter)  ite  at. Icon ,  .since  nodes  com¬ 
parable  in  size  to  the  region  iie  only  (log  region 
diameter)  levels  above  the  pixel  ■ evul.  Moreover, 
in  our  process,  smoothing  can  »  -  tic  place  even  over 
sets  of  vion-connccted  regions  ..he  same  type, 
whereas  the  process  of  (1]  can  smooth  only  within 
a  connected  region. 

The  concept  of  linking  each  node  with  its  most 
similar  father  may  be  compated  with  the.  smoothing 
processes  described  in  [4-6],  where  a  set  of  neigh¬ 
borhoods  lying  on  various  sides  of  a  pixel  are 
examined,  and  the  pixel's  value  is  replaced  by  the 
average  of  the  least  variable  of  these  neighbor¬ 
hoods  (since  this  neighborhood  presumably  lies 
almost  entirely  within  the  pixel’s  region).  Using 
the  most  similar  neighborhood  (l.e.,  *•;  s  one  v?hose 
average  is  closest  to  the  pixel's  g.  ,  ievel), 
rather  than  the  least  variable  neighborhood,  would 
probably  work  well  too;  but  we  could  not  use  the 
least  variable  neighborhood  in  our  scheme.  In 
any  event,  the  methods  of  [4-  6}  use  neighborhoods 
of  only  a  tingle  size,  which  limits  the  speed  with 
wh„ch  the  smoothing  cat.  propagate,  as  discussed  in 
the  previous  paragraph. 

VARIATIONS 

In  the  experiments  described  in  this  section, 
several  variations  on  the  basic  pyramid  linking 
process  were  tried.  These  variations  were  con¬ 
cerned  with  how  to  iviitiallze  the  node  values;  bow 
to  choose  the  father  to  which  a  node  is  linked, 
and  in  particular,  what  to  do  in  case  of  ties; 
and  how  the  iteration  process  is  sequenced.  In 
the  following  paragraphs  we  describe  the  varia¬ 
tions,  and  then  show  the  results  obtained  by  using 
combinations  of  these  variations  on  a  standard 
set  cf  images  (which  were  alao  used  in  [2]);  an 
infrared  image  of  a  tank,  a  portion  of  a  blood 
smear,  and  a  portion  of  a  chromosome  spread.  These 
image:'  are  showr,  in  Figure  1  (a-c).  All  results 
are  shewn  for  a  stage  at  which  the  iteration  pro¬ 
cess  has  stabilized;  thiB  is  usually  after  about 
10  Iterations. 

a)  Initialization.  In  the  method  used  in  '2] , 
the  value  of  each  node  was  initialized  by 
averaging  the  values  of  all  16  of  its  sons. 
An  alternative  which  (as  we  shall  see) 


seems  to  give  better  results  is  to  ini¬ 
tialize  by  averaging  the  values  for  only 
four  of  the  sons,  namely  thoBe  whose  posi¬ 
tions  in  the  image  are  closest  to  that  of 
the  node.  (The  position  of  a  node  is 
understood  to  be  at  the  center  of  its 
block.)  Note  that  in  this  alternative 
scheme,  the  initial  averages  are  all 
nonoverlapping. 

b)  Father  selection.  In  the  method  of  [2] 
each  node  is  linked  to  the  father  closest 
in  value  to  the  node.  A  more  general  idea 
is  to  take  into  account  both  closeness  in 
value  and  closeness  in  position.  We  can 
compute  link  merits  based  on  a  formula 
such  as  A(D+s) ,  choosing  the  father  for 
which  A(D+s)  is  smallest,  where  A  is  the 
difference  in  value,  D  is  the  Euclidean 
distance  between  positions,  and  s  is  a 
parameter  which  is  used  to  vary  the  effect 
of  the  D  contribution  (for  large  s,  differ¬ 
ences  in  D  have  little  effect). 

b')  Ties ■  If  two  fathers  have  the  same  link 
merits,  we  resolve  the  tie  based  on  any 
arbitrary  ordering  of  the  fathers,  e.g. , 

NW,  NE,  SE,  SW.  The  choice  of  this  order¬ 
ing  should  not  significantly  affect  the 
results . 

c)  Sequencing.  In  [2] ,  links  are  determined 
for  all  levels;  then  averages  are  recom¬ 
puted  tor  all  levels;  and  this  process  is 
repeated.  .In  alternative  is  to  iterate 
level  by  level:  as  soon  as  the  links  from 
the  nodes  at  level  U  are  redefined,  the 
averages  at  ievel  k+1  are  recomputed,  and 
the  „in.,s  from  level  k+1  are  then  redefined 
based  on  these  new  averages. 

u)  Top  level  nodes.  The  number  (s.  4)  ot  nodes 
used  at  the  top  level  should  be  the  same  as 
the  desired  number  of  region,  types  -  2  fov 
the  tank  and  chromosomes ,  3  f cr  the  blood- 
cel  la.  We  can  insure  that  only  two  ol  three 
node.B  at  the  top  level  are  used  by  initial¬ 
izing  the  valuer  of  the  remaining  node(s) 
to  a  vety  high  number,  thus  insuring  that 
no  nodes  will  ever  link  to  them.  As  a 
refinement,  we  can  fix  the  top-level  nodes 
that  we  do  use  to  have  values  that  repre¬ 
sent  estimates  of  che  expected  region 
averages;  we  will  show  ficrne  examples  using 
this  variation.  We  will  also  show  examples 
of  results  obtained  when  we  use  ail  four 
nodes  at  the  top  1  ovel ,  even  though  the 
desired  number  of  region  types  is  less 
than  four.  As  we  shall  see,  the  process 
then  tends  to  create  roaewhat  artificial 
discriminations  within  the  regions. 

We  first  show  the  results  obtained  when  we 
uce  the  desired  number  of  nodes  at  the  top  level, 
but  do  not  attempt  to  set  the  values  of  those 
nodes  to  the  expected  region  averages.  Figure  2 
(top  two  rs/ws)  shows  there  results  for  the  lour 
combinations  of  initialization  and  sequencing 
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schemes.  Vie  see  that  in  the  chromosome  case 
(Figure  2c),  four-son  initialization  gives  better 
results;  when  I6~oon  initialization  is  used,  some 
of  the  small  chromosomes  are  lost,  probably  be¬ 
cause  too  much  of  the  background  is  initially 
averaged  with  them,  so  that  they  link  to  a  top- 
level  node  whose  value  convergea  to  the  back¬ 
ground  value  rather  than  to  the  chromosome  value. 
The  Initialization  scheme  has  little  effect  on 
the  results  for  the  other  two  images,  and  the 
iteration  sequencing  scheme  has  little  effect  on 
any  of  -he  images.  The  order  used  for  tie-breaking 
also  has  little  effect,  as  we  see  from  the  bottom 
left  pictures  in  Figurt  2  (which  use  the  same 
initialization  and  sequencing  schemes  as  the  top 
left  pictures).  Finally,  the  bottom  right  pictures 
in  Figure  2  show  whet  happens  when  we  give  some 
weight  to  Euclidean  distance  (s«5)  in  choosing 
the  lir.ks  (otherwise,  same  as  top  left);  note 
that  this  too  improves  the  results  in  the  chromo¬ 
some  case,  and  has  little  effect  in  the  other 
two  cases.  It  seems  from  these  results  that 
four-son  initialization  Is  preferable  to  16-son 
initialization,  and  chat  it  way  also  be  prefer¬ 
able  to  give  3ome  weight  to  Euclidean  distance  in 
chewing  links;  bet  the  other  variations  make  lit¬ 
tle  difference.  The  exact  shapes  ot  the  tank 
and  cell  nucleus  ute  somewhat  sensitive  to  varia¬ 
tions  because  the  ocrrect  links  for  blocks  near  the 
borders  of  these  regions  will  be  somewhat  ambi¬ 
guous,  due  tc  the  noisiness  or  texturedneas  of 
the  regions. 

Figure  3  shows  analogous  results  when  the 
top-level  nodes  are  given  estimates  of  the  aver¬ 
age  region  gray  levels  as  fixed  values.  Again, 
the  variations  make  little  difference  for  the 
tank  and  cell  images,  but  they  are  significant  for 
the  chromosome  image.  The  loss  of  the  small  chro¬ 
mosome  has  now  become  dependent  on  the  iteration 
sequence  and  ever,  on  the  tie-breaking  order  (!); 
and  when  we  use  the  four-son  initialization  method, 
a  large  chromosome  is  lo*t  (1).  Apparently, 
attempting  to  fix  the  values  of  the  top-leval  nodes 
as  equal  to  the  estimated  region  averages  can 
actually  degrade  the  performance  of  the  pyramid 
linking  process. 

Figure  4  gives  analogous  results  when  all  tour 
top-level  nodes  are  used,  so  that  the  process  tries 
to  find  four  region  types  in  each  image.*  The 
resulting  artifacts  ar."  especially  apparent  for 
the  chromosome  imago,  where  the  background  gets 
segmented  into  three  subregions  that  differ  appre¬ 
ciably  in  average  gray  level,  liere  again,  when 
we  use  16-son  initialization,  the  small  chromosomes 
become  part  of  the  background,  but  this  does  not 
happen  when  four-son  initialization  is  used,  nor 
when  weight  is  given  to  Euclidean  distance  in 
choosing  links.  The  other  variations  have  little 
effect,  and  all  ot  the  ctfocts  are  iiinor  for  the 

*if  the  input  image  contains  fewer  than  four  gray 
levelB  (e.g.,  it  we  threshold  the  chromosome  or 
tank  image  Into  two  levels  or  the  cell  image  into 
three) ,  the  p. ocess  does  not  create  additional 
values,  even  though  all  four  top-level  nodes  are 
used. 


cell  and  tank  images  (the  tank  region  splits  up 
into  "noisy"  subregions  in  various  ways,  but 
does  not  get  badly  confused  with  the  background) . 
Thus  these  results  support  the  conclusions  derived 
from  Figure  2. 

When  the  process  is  applied  to  a  perfectly 
regular  input  pattern  such  as  a  checkerboard,  it 
breaks  down  and  fails  to  segment  the  pattern  into 
two  region  types,  unless  ties  are  broken  randomly. 
Figure  5  shows  results  analogous  to  those  in 
Figure  4  (left  column  corresponds  to  Fig.  4  top 
left,  and  right  column  to  bottom  right),  but 
using  random  tie-breaking:  the  results  are  quite 
similar. 

The  smoothing  effect  of  the  process  as  we 
follow  the  links  from  level  to  level  can  be 
assessed  by  constructing  h'stograms  corresponding 
to  each  level's  view  of  the  image.  Suppose  that, 
for  a  given  k,  we  give  ecch  pixel  a  gray  level 
equal  to  the  value  of  the  node  at  level  k  to  which 
it  is  linked.  When  we  do  this  for  k“0,  1,  2,  ..., 
we  obtain  a  sequence  of  successively  smoother  and 
simpler  images,  whose  histograms  become  successively 
more  spiky,  until  finally,  the  histogram  obtained 
from  the  top  level  consists  of  fat  most)  four 
spikes.  Such  histograms  for  the  three  images, 
after  one  iteration  of  the  linking  and  reaveraging 
process,  are  shown  in  Figure  6  for  levels  0,  1, 

2,  3,  4.  (16-son  initialization  was  used,  and 

links  were  chosen  based  on  value  similarity  only). 

If  we  did  not  want  to  rely  on  the  Iterative  pro¬ 
cess  to  converge  to  a  good  segmentation,  we  could 
still  consider  using  a  single  iteration  of  the 
process  to  improve  the  separation  of  the  histogram 
peaks,  so  that  segmentation  by  thresholding  based 
on  the  histogram  would  be  easier. 

CONCLUDING  REMARKS 

This  paper  has  investigated  a  number  of  varia¬ 
tions  on  the  basic  pyramid  linking  process.  The 
results  suggest  the  following  tentative  conclusions: 

a)  It  appears  to  be  preferable  to  use  schemes 
in  which  some  weight  is  given  to  t.>e  rela¬ 
tive  positions  of  nodes,  both  in  ii itial- 
izir.g  their  values  and  in  choosing,  links, 
especially  in  cases  involving  regions  that 
consist  of  many  small  connected  components. 
Apparently,  when  we  take  relative  posi¬ 
tions  into  account,  we  have  a  better  chance 
of  preserving  the  integrity  of  small  regions 

b)  It  is  desirable  to  specify  the  desired  num¬ 
ber  of  region  types  (i.e.,  pixel  classes), 
by  allowing  only  that  number  of  nodes  at 
the  top  level  to  be  "active."  Otherwise, 
the  process  tends  to  split  some  of  the 
classes  artificially.  On  the  other  hand, 
using  estimates  of  the  average  gray  levels 
of  the  classes  to  fix,  the  values  of  the 
nodes  at  the  top  level  may  degrade  the 
results,  perhaps  because  it  introduces 
premature  biases  that  are  not  compatible 
with  the  early  stages  of  the  linking  pattern 
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c)  The  process  is  relatively  insensitive  to  the 
sequencing  of  the  iterations  and  to  the 
node  ordering  used  for  breaking  ties. 

The  experiments  reported  in  this  paper  have  led 
to  a  better  understanding  of  the  pyramid  linking 
concept.  The  conclusions  will  serve  as  guide¬ 
lines  in  the  design  of  linking  processes  based  on 
pixel  properties  other  than  (average)  gray  level, 
for  application  to  the  smoothing  and  segmentation 
of  multispectral  or  textured  images. 
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Figure  1.  The  three  images  used  in  the  experiments: 

a)  tank,  b)  blood  cells,  c.)  chromosomes. 


$ 


*  V 


Figure  2.  Effects  of  varying  the  initialization, 

sequencing,  tie-breaking  rule,  and  link¬ 
ing  criterion  (see  text). 
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Figure  3,  Analogous  to  Figure  2,  but  initializing 
the  top-level  nodes  with  estimates  of 
the  region  averages. 


Figure  A.  Analogous  to  Figure  2,  but  using  all  four 
nodes  at  the  top  level. 


Figure  5.  Analogous  to  the  top  left  and  bottom 

right  pictures  in  Figure  2,  but  breaking 
ties  randomly. 
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Figure  6.  Histograms  obtained,  after  one  iteration 
of  linking  and  re-averaging,  when  the  nod 
values  at  a  given  level  are  assigned  to 
the  pixels  having  those  nodes  as  ancestor 
for  levels  3  4 
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Abstract 

General  constraints  on  the  interpretation  of  image  bound¬ 
aries  are  described  and  implemented.  We  illustrate  the  use 
of  these  constraints  to  carry  out  geometric  interpretation  of 
images  up  to  the  volumetric  level.  A  general  coincidence  as¬ 
sumption  is  used  to  derive  suggestive  but  incomplete  inter¬ 
pretations  for  local  features.  A  reasoning  system  is  described 
which  can  use  these  suggestive  hypotheses  to  derive  consis¬ 
tent  global  interpretations,  while  maintaining  the  ability  to 
remove  the  implications  of  hypotheses  which  arc  disproved 
in  the  face  of  further  evidence.  An  important  aspect  of  inter¬ 
pretation  is  the  classification  of  image  boundaries  (intensity 
discontinuities?  into  those  caused  by  geometric,  reflectance, 
or  illumination  discontinuities.  These  interact  with  other 
hypotheses  regarding  occlusion  by  solid  objects,  the  direc¬ 
tion  of  illumination,  aspects  of  object  geometry,  and  the 
production  of  illumination  discontinuities  by  geometric  dis¬ 
continuities.  Although  only  a  subset  of  the  constraints  and 
system  design  features  have  bec-i  implemented  to  date,  we 
demonstrate  the  successful  interpretation  of  some  simulated 
image  boundaries  up  to  the  volumetric  level,  including  the 
construction  of  a  three-space  model. 


Introduction 

This  paper  describes  work  in  progress  on  the  deriva¬ 
tion  and  use  of  general  constraints  on  the  interpretation 
of  image  curves  (also  known  as  intensity  discontinuities  or 
lines).  These  constraints  are  derived  from  general  assump¬ 
tions  regarding  illumination,  object  geometry,  and  the  im¬ 
aging  process,  and  arc  carried  out  to  the  level  of  three- 
dimensional  volume  interpretations.  This  work  is  being 
done  in  the  context  of  the  Acronym  vision  system  (2], 
and  is  expected  to  interface  closely  with  the  higher  levels 
of  Acronym,  which  make  interpretations  from  generic  ob¬ 
ject  models.  The  current  implementation  of  these  con¬ 
straints  in  an  independent  computer  program  is  described 
and  demonstrated. 

Much  of  the  previous  work  on  the  interpretation  of 
image  lines  has  concentrated  on  the  constraints  imposed  on 
boundary  junctions  by  certain  classes  of  geometric  objects 
[3,  5,  6,  7).  We  believe  that  there  are  more  general  con¬ 
straints  on  the  formation  of  image  boundaries,  and  that  the 


use  of  these  constraints  is  of  great  importance  for  the  inter¬ 
pretation  of  real  data  and  general  classes  of  images.  Most 
previous  attempts  at  boundary  interpretation  have  used  con¬ 
nectivity  as  the  main  source  of  information,  whereas  we  have 
placed  at  least  as  much  emphasis  on  shape,  size  and  loca¬ 
tion.  We  also  make  some  use  of  the  intensity  contrast  across 
boundaries.  Our  implementation  of  these  constraints  is  de¬ 
signed  so  that  no  single  constraint  is  taken  to  be  absolute, 
and  the  system  can  therefore  tolerate  some  errors  and  in¬ 
completeness  in  its  initial  data. 

it  is  important  to  realize  that  there  is  seldom  a  unique 
interpretation  for  any  local  image  property.  All  of  the  con¬ 
straints  which  we  use  are  in  the  form  of  coincidence  assump¬ 
tions,  in  which  seme  property  suggests  a  paiticular  inter¬ 
pretation  unless  some  set  of  coincidences  (or  errors  in  the 
data)  have  occurred.  This  means  that  our  reasoning  can  not 
be  strictly  deductive  and  monnlonic  (as  was,  for  example, 
the  constraint  system  used  by  Waltz  [7|),  since  we  must  make 
and  use  hypotheses  while  retaining  the  option  of  having  fur¬ 
ther  information  prove  them  false.  Therefore,  an  important 
aspect  of  this  work  on  image  interpretation  has  been  the  de¬ 
velopment  of  a  reasoning  system  which  can  operate  cleanly 
and  efficiently  with  these  incomplete  constraints.  On  '.he 
other  hand,  the  search  space  for  this  problem  is  not  large, 
since  most  of  the  constraints  have  a  small  branching  factor 
and  there  is  typically  much  redundant  information  available 
in  support  of  an  interpretation. 

We  categorize  image  intensity  discontinuities  into  tb  .ee 
distinct  classes:  those  caused  by  discontinuities  in  the 
geometry  of  an  object  (edges),  in  the  reflectance  of  an  ob¬ 
ject  (markings),  and  in  the  illumination  (shadows).  The  con¬ 
straints  on  each  of  these  categories  are  quite  different,  and 
much  or  the  discussion  in  this  paper  will  deal  with  the  recog¬ 
nition  and  separate  implications  of  these  different  classes  of 
image  curves. 

Reasoning  on  the  basis  of  coincidence  assumptions 

As  mentioned  above,  the  use  of  incomplete  constraints 
forces  us  to  use  a  non-monoiouic  reasoning  system,  in  which 
further  evidence  can  disprove  a  previously  held  hypothesis. 
Therefore,  the  method  of  deducti  we  have  chosen  has 
been  to  form  an  explicit  syinbolit  ^presentation  of  each 
hypothesis,  and  to  maintain  a  record  of  its  support  and 
implications  so  that  it  can  he  reevaluated  and  undone  if  new 
contextual  evidence  indicates  that  the  original  hypothesis 
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was  false,  Mach  hypothesis  has  pointers  to  all  the  hypotheses 
and  types  of  evidence  which  have  given  support  to  it,  as  well 
as  pointers  to  the  hypotheses  which  it  supports. 

Figure  1  shows  the  data  structure  representation  for  a 
typical  hypothesis  (in  this  example  the  hypothesis  is  that 
a  particular  curve  represents  a  geometric  edge  causing  oc¬ 
clusion  on  a  specific  side).  The  hypothesis  in  this  example 
is  supported  by  evidence  from  several  other  hypotheses 
(represented  in  the  Evidence  slot)  and  makes  use  of  three 
basic  constraints  (to  be  described  in  detail  in  the  follow¬ 
ing  section):  that  the  geometric  discontinuity  corresponds  to 
an  intensity  discontinuity  in  the  image,  that  the  occlusion 
is  suggested  by  the  termination  of  auothe.  geometric  edge 
at  this  geometric  edge,  and  that  this  edge  casts  a  certain 
shadow.  The  Suggests  sjot  of  the  data  structure  points  to 
all  hypotheses  which  arc  based  at  least  in  part  on  this  one. 

When  conflicting  interpretations  arc  suggested  for  some 
part  of  the  image,  the  evidence  for  each  interpretation  is 
evaluated  to  see  if  one  interpretation  is  strong  enough  to 
exclude  the  other.  If  there  is  insufficient  information  to 
make  this  decision,  then  both  hypotheses  are  pursued  uutil 
there  is.  When  a  previously  neld  hypothesis  is  thought  to 
be  false,  its  implications  arc  undone  and  any  further  results 
propagated.  This  nou-coinmital  type  of  reasoning  appears 
to  be  flexible  and  easy  to  use.  The  major  difficulty  is  that 
this  requires  that  special  code  be  written  to  evaluate  and 
resolve  each  type  of  conflict  that  could  occur,  but  we  are 
looking  at  other  possible  resolution  schemes.  The  explicit 


maintainancc  of  constraints  is  much  better  than  the  use  of 
backtracking  for  testing  hypotheses,  since  it  is  not  limited 
to  the  !a#t-in-first-out  order  of  testing  typically  enforced  by 
a  backtracking  stack,  and  any  results  apply  to  the  entire 
problem  space. 

The  coincidence  assumptions  could  sometimes  be  used 
to  measure  the  likelihood  that  a  given  coincidence  will  oc¬ 
cur,  and  therefore  be  used  to  compute  a  probabilistic  degree 
of  confidence  in  a  hypothesis.  However,  these  probabilistic 
measures  can  be  extreme!”  context-dependent  and  may  vary 
widely  from  image  to  image  in  ways  that  can  not  be  known 
before  the  image  is  interpreted  (ie. ,  tLc  probabilities  cannot 
reasonably  be  assumed  to  be  independent).  The  combina¬ 
tion  of  probability  values  may  uc  of  importance  in  some 
rases  for  speeding  up  the  analysis,  by  telling  us- what  to 
examine  first,  but  we  have  made  no  attempt  to  use  them 
in  our  current  system.  When  some  consistent  overall  in¬ 
terpretation  has  been  found  for  an  image,  there  is  usually 
much  rc-lundant  information  available  in  support  of  the  in¬ 
terpretation  and  there  >«  need  to  -  hoose  between  alter¬ 
natives  on  the  basis  of  explicit  probabilities.  Our  set  of 
hypo*'  escs  resemble  the  “black:  oard"  form  of  hypothesis 
representation  [4]  in  many  ways;  however,  we  attempt  to 
resolve  conflicts  by  pursuing  each  possibility  in  lependcntiy 
rather  than  by  combining  probability  estimates,  and  we 
maintain  the  explicit  history  of  eacli  hypothesis  in  symbolic 
form. 
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CoDit,  aints  on  the  interpretation  of  image  curve* 

This  section  describes  a  scries  of  general  constraints  on 
the  thrce-spacc  interpretation  of  image  lines.  In  each  case 
we  attempt  to  describe  the  specific  coincidences  which  would 
lead  to  alternate  interpretations.  Many  of  these  constraints 
arc  developed  in  more  detail  in  |1). 

In  a  certain  trivial  sense,  any  of  the  image  features 
described  in  the  following  constraints  could  be  the  result 
entirely  of  reflectance  discontinuities  rather  than  geometric 
or  illuminatiou  discontinuities,  siuce  the  image  could  be  a 
picture  of  a  picture.  However,  interpretations  which  are 
consistent  with  solid  geometry  and  illumination  are  un¬ 
likely  to  arise  from  surface  markings,  except  when  those 
markings  have  been  specifically  designed  to  correspond  to 
images  of  geometric  objects.  Therefore,  we  treat  surface 
markings  which  have  consistent  geometric  interpretations  as 
coincidences  which  are  hypothesized  only  when  given  other 
evidence. 

The  “curves"  described  iu  the  following  constraints  are 
assumed  to  separate  regions  of  different  intensities.  Regions 
which  arc  too  narrow  to  have  a  measurable  width  ("wires"  or 
thin  lines  marked  on  a  surface),  although  often  represented 
as  ordinary  lines  in  a  line  drawing,  are  locally  different  f-om 
other  types  of  curves  in  a  digitized  image  and  are  local¬ 
ized  by  different  criteria  (sec  1].  They  should  be  treated  as 
thin  occluding  regions,  even  though  the  precise  width  of  the 
regions  may  not  be  accurately  known.  If  we  wish  to  inter¬ 
pret  line  drawings  in  which  there  is  no  distinction  locally 
between  thin  regions  and  boundaries  separating  regions  of 
different  intensities,  then  the  following  constraints  can  be 
easily  modified  to  incorporate  this  possibility. 

1)  Tangent  discontinuities  in  curves.  Breaks  in  an  image 
curve  (also  known  as  tangent  discontinuities  or  comers)  are 
important  for  the  same  reason  that  iuteusity  discontinuities 
arc  important  in  an  image:  they  are  features  which  can  be 
sharp'y  localized,  and  can  therefore  lead  to  more  restrictive 
constraints  than  more  diffuse  features.  Two-dimensional 
breaks  in  an  image  curve  are  closely  related  to  breaks  in 
three-dimensional  space  curves:  in  fact,  a  smooth  space 
curve  car, not  have  a  break  in  its  image.  Likewise,  a  break  in 
a  space  c  ..rve  will  result  in  a  break  in  its  image,  unless  there 
is  a  coin  :  dcnce  in  which  the  observer  is  coplanar  with  the 
two  tangents.  The  situation  is  the  same  for  a  cast  shadow: 
there  will  be  a  break  in  the  shadow  cast  by  a  geometric  edge 
which  contains  a  break,  unless  the  source  of  illumination  is 
coiucidcntally  coplanar  with  the  two  tangents  of  the  break. 

2)  Straight  lines.  An  image  line  which  is  straight  must 
be  the  image  of  a  straight  space  curve,  unless  the  curve  is 
planar  and  the  observer  is  coincidentally  aligned  with  the 
plane  of  curvature.  Likewise,  a  straight  shadow  curve  in  the 
image  must  have  been  cast  by  a  straight  geometric  boundary 
onto  a  planar  surface,  unless  the  source  of  illumination  is 
coincidentally  in  the  plaue  of  curvature  of  the  geometric 
boundary  or  the  observer  is  in  the  plane  of  curvature  of  the 
shadow  curve. 

3)  Termination  at  a  continuous  curve.  When  an  image  curve 


terminates  at  a  continuous  curve  (»  T  junction),  the  ter¬ 
minating  curve  cannot  be  closer  to  the  observer  than  the 
continuous  curve;  otherwise,  it  would  be  a  coincidence  that 
the  termination  happened  to  occur  on  the  other  line.  The  T 
junction  could  be  the  result  of  three  different  occurrences: 
occlusion  of  the  terminating  edge  by  a  geometric  bound¬ 
ary,  the  termination  of  a  surface  marking  at  the  visible 
edge  of  a  geometric  object,  or  a  specific  set  of  surface  mark¬ 
ings.  Therefore,  if  we  kuow  that  the  terminating  curve  is  a 
geometric  boundary,  then  we  can  infer  that  the  continuous 
curve  is  also  a  geometric  boundary  and  we  know  its  direc¬ 
tion  of  occlusion.  If  wc  know  that  the  continuous  curve  is  a 
geometric  boundary  occluding  on  the  side  of  the  terminating 
curve,  then  we  can  infer  that  the  terminating  curve  must  be 
a  surface  marking.  If  wc  know  that  the  terminating  curve  is 
a  shadow,  then  we  can  infer  that  the  continuous  curve  is  a 
non-concave  geometric  boundary  (if  it  was  concave  we  would 
set  a  continuation  of  the  shadow  from  the  same  vertex). 

4)  Crossing  of  continuous  curves.  When  two  continuous 
curves  cross  oue  another  (an  X  junction),  this  is  an  indicator 
of  either  an  illumination  discontinuity,  transparency,  or  <ui 
unusual  combination  of  surface  markings,  since  the  curve 
closest  to  the  observer  does  uot  occlude  either  side  of  the 
other  (note  that  we  are  assuming  other  local  evidence  for 
wires,  as  mentioned  above).  If  one  of  the  curves  is  an 
illumination  discontinuity  (shadow  boundary)  this  implies 
that  the  other  curve  lies  on  the  same  surface  and  is  not  a 
geometric  boundary,  since  otherwise  a  break  would  occur 
(as  will  be  discussed  further  below).- 

5)  Contrast  across  shadow  edges.  The  contrast  across  a 
shadow  edge  is  equal  to  the  ratio  of  direct  to  indirect  light, 
which  remains  fairly  constant  across  an  image  for  a  dis- 
laut  light  source.  However,  the  situation  is  complicated 
by  the  presence  of  reflections  from  nearby  objects,  which 
can  considerably  increase  the  amount  of  indirect  illumina¬ 
tion.  A  stronger  constraint  is  that  the  contrast  ratio  along 
the  length  of  a  shadow  will  change  only  smoothly  and  in¬ 
dependently  from  the  surface  on  which  it  falls,  so  that  as 
the  shadow  curve  crosses  different  reflectance  boundaries 
(or  even  crosses  illumination  boundaries  cast  by  secondary 
sources  of  illumination)  there  will  be  the  same  contrast  ratio 
on  both  sides  of  the  boundary.  This  is  a  valuable  constraint 
for  identifying  shadow  lines.  In  the  case  of  the  X  junction 
mentioned  above  in  (4),  the  illumination  discontinuity  will 
have  the  same  contrast  ratio  on  both  sides  of  the  junction, 
and  can  therefore  be  distinguished  from  the  other  curve  at 
the  X  junction,  barring  coincidence.  In  color  imagpry,  the 
ratios  of  illumination  at  each  wavelength  across  a  shadow 
boundary  should  be  fairly  constant  and  vary  only  smoothly 
along  the  length  of  the  shadow,  and  could  therefore  provide 
an  oven  stronger  constraint  (eg.  the  dark  side  of  a  shadow 
boundary  will  be  bluer  than  the  sunlit  side  on  a  clear  day). 

6)  Shadow  breaks  caused  by  geometric  breaks.  As  men¬ 
tioned  in  (1),  barring  coincidence,  breaks  (tangent  discon¬ 
tinuities)  in  geometric  edges  are  observable  as  breaks  in 
image  lines,  and  geometric  edges  with  breaks  cast  shadows 
with  breaks.  Therefore,  given  a  known  direction  of  illumina¬ 
tion,  the  geometric  break  causing  any  particular  shadow 
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break  is  constrained  to  lie  in  a  precise  direction.  This  con¬ 
straint  is  so  strong  ill  at  the  mere  existence  of  a  break  in  the 
given  direction  is  support  for  the  hypothesis  of  a  shadow.  If 
the  direction  of  illumination  is  unknown,  then  almost  paral¬ 
lel  sets  of  matching  breaks  between  hypothesized  shadows 
and  other  image  lines  provide  evidence  for  the  hypothesis 
of  the  direction  of  illumination  (it  can  usually  be  assumed 
that  the  source  of  illuurnation  is  moderately  distant  from 
at  least  some  parts  of  an  image). 

In  perspective  imagery  there  is  an  illumination  conver¬ 
gence  point  in  the  image  through  which  the  images  of  all 
illuminatiou  rays  from  a  point  source  pass  (this  is  true  even 
for  nearby  point  sources).  The  location  of  this  point  can  be 
determined  from  any  set  of  two  or  more  matches  between 
shadow  and  geometric  breaks.  If  the  point  source  is  in  front 
of  the  camera  lens  plane,  then  the  convergence  point  is  of 
course  the  location  of  the  image  of  the  point  source.  If 
the  light  souice  is  behind  the  camera  lens  piano,  then  the 
illumination  convergence  point  is  located  at  the  point  of 
projection  of  the  light  source  oneo  the  him  plane  through 
the  projective  center  of  the  camera,  and  the  illumination 
streams  towards  this  point  rather  than  away  from  it.  If 
the  point  source  is  exactly  in  the  lens  plane  of  the  camera, 
then  the  perspective  effect  compensates  for  divergence  from 
the  light  source  to  make  the  illumination  convergence  point 
infinitely  far  off,  so  all  images  or  illumination  rays  are  ex¬ 
actly  parallel  (those  insights  are  due  to  Sid  Liobes). 

It,  should  be  noted  that  breaks  in  shadow  curves  can  also 
be  caused  by  the  intersection  of  shadows  cast  by  different 
objects,  so  it  cannot  be  assumed  that  there  is  a  geometric 
break  corresponding  to  every  shadow  break.  The  constraints 
given  here  could  be  extended  to  cover  discontinuities  in  cur¬ 
vature  and  other  shape  properties  as  well  as  discontinuities 
in  tangents,  although  these  are  more  difficult  to  implement 
computationally. 

7)  Casting  of  shadow  curves  uy  geometric  edges.  Given 
one  or  more  snatches  between  shadow  breaks  and  geometric 
breaks  it  is  easy  to  determine  which  geometric  edg.  ire 
casting  which  shadow  edges.  An  important  constrni 
that  each  point  on  the  shadow  edge  must  correspond  > 
some  point  on  the  geometric  edge  in  the  direction  toward 
or  away  from  the  illumination  vanishing  point,  although  all 
parts  oi  the  geometric  edge  may  not  be  observable  (say, 
if  it  is  occluded  by  another  object).  Once  a  match  has 
been  hypothesized  between  a  shadow  edge  and  a  geometric 
edge,  many  important  inferences  follow:  the  casting  curve 
is  known  to  be  a  geometric  edge  or  limb  with  the  direction 
of  occlusion  depending  on  the  direction  of  contrast  across 
the  shadow  edge.  If  the  geometric  edge  is  straight  then  any 
curvature  in  the  shadow  curve  is  due  to  curvature  in  the 
surface  on  which  it  is  cast.  Likewise,  if  the  shadow  surface 
is  known  to  be  planar,  the  curvature  of  the  shadow  can  be 
used  to  calculate  the  curvature  of  the  casting  edge.  The 
image  separation  of  the  casting  curve  from  the  shadow  can 
be  used  to  calculate  their  relative  range  from  the  camera,  as 
will  be  described  later.  Shadows  essentially  provide  a  second 
projection  of  the  object  geometry,  in  addition  to  the  original 
image,  and  can  therefore  be  used  in  much  the  same  way  as 


stereo  information. 

8)  Junctions  of  two  or  more  discontinuous  curves.  When 
two  or  more  curves  terminate  at  the  same  junction,  it 
would  be  a  co'  cidcncc  if  the  observes  were  aligned  so  that 
separated  vertices  in  space  landed  at  the  same  point  in  the 
image.  Therefore,  for  L,  Y,  K,  or  higher-order  junctions,  it 
is  reasonable  to  hypothesize  that  coincidence  in  the  image 
implies  coincidence  in  space.  In  other  words,  knowing  con¬ 
straints  on  the  3-space  location  of  the  endpoint  of  any  of  the 
terminating  curves  gives  us  the  fame  constraints  on  the  3- 
space  locations  of  the  other  endpoints  at  the  vertex.  When 
a  shadow  curve  terminates  at  a  Y  or  higher-order  junction, 
we  can  assume  that  one  of  the  other  terminating  curves  is 
a  geometric  edge  casting  this  shadow  onto  a  surface  passing 
through  the  junction  (otherwise  it  would  be  a  coincidence 
that  the  shadow  happened  to  pass  through  the  junction). 

9)  Propagation  of  direction  of  occlusion.  When  the  direction 
of  occlusion  is  known  for  a  geometric  boundary  (ic.,  wc 
know  which  of  the  surfaces  on  either  side  the  edge  belongs 
to),  then  any  unambiguous  continuations  of  the  geometric 
boundary  will  have  ‘he  same  direction  of  occlusion.  As  long 
as  the  curve  is  continuous  we  assume  continuity  in  space, 
and  by  (7)  wc  also  assume  continuation  at  L  junctions. 

10)  Surface  continuity.  We  assume  smoothness  and  con¬ 
tinuity  of  surfaces  when  there  is  no  intensity  discontinuity 
in  the  image.  If  an  intensity  discontinuity  on  a  surface  is  due 
to  a  geometric  discontinuity  of  the  surface,  tficn  we  would 
expect  to  see  a  discontinuity  in  the  geometric  boundary  of 
the  surface  where  it  intersected  this  intensity  discontinuity 
(unless  the  observer  is  in  the  plane  of  the  two  tangents  at 
the  boundary). 

Knowing  the  direction  of  occlusion  (from  3,7,9,  etc.) 
allows  us  to  form  surface  descriptions  by  looking  at  the 
regions  bounded  by  geometric  boundaries.  In  many  cases 
it  is  possible  to  form  surface  descriptions  by  merely  follow¬ 
ing  the  continuations  of  geometric  boundaries  around  the 
extent  of  the  surface,  until  they  arc  possibly  occluded  by 
another  surface.  Another  technique  is  to  work  away  from  the 
occluding  side  of  a  geometric  boundary,  ignoring  illumina¬ 
tion  and  reflectance  discontinuities,  until  either  an  oppos¬ 
ing  geometric  boundary  is  found  or  an  occluding  geometric 
object  is  found  Note  that,  this  is  different  than  the  usual 
region  segmentation  of  images,  since  we  are  doing  it  in  the 
geometric  domain.  Tbesr  can  be  difficult  constraints  to 
implement  computationally,  since  they  deal  with  properties 
of  entire  surfaces  rather  than  just  a  curve.  Our  current  im¬ 
plementation  handles  only  some  of  the  more  common  cases 
of  surface  description. 

11)  Alignment  of  image  curves.  When  two  straight  lines  are 
aligned  in  an  image  (even  though  they  may  be  separated  by 
a  gap  of  a  considerable  distance),  they  must  also  be  aligned 
in  space,  or  else  the  lines  must  be  parallel  and  the  observer 
must  be  coincidentally  in  the  plane  of  the  two  lines.  This 
constraint  allows  ns,  for  example,  to  hypothesize  continuity 
of  line  segments  on  both  sides  of  an  occluding  object.  This 
constraint  can  be  extended  to  deal  with  the  alignment  of  cir¬ 
cular  arcs,  elliptical  curves,  repetitive  textures,  symmetries, 
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and  any  shapes  which  arc  predicted  from  hypotheses  and 
knowledge  of  the  imaged  object.  This  constraint  can  also 
be  used  to  bridge  gaps  in  curves  due  to  errors  in  the  curve 
detection  process  or  insufficient  contrast  across  some  part  of 
a  boundary.  This  is  an  important  area  for  further  research. 

12)  Prediction  of  illumination  boundaries.  If  we  have 
formed  hypotheses  of  the  geometry  of  an  object,  the  sur¬ 
rounding  surfaces,  and  the  direction  of  illumination,  then 
we  can  predict  the  locations  of  illumination  discontinuities 
in  the  image,  and  thereby  produce  new  hypotheses  for  image 
lines  at  these  locations,  as  well  as  confirm  or  contradict  our 
original  hypotheses.  Prediction  of  this  sort  can  also  be  used 
to  check  consistency  of  surface  occlusion. 

Since  we  are  attempting  to  derive  three-dimensional 
structure  from  image  clues,  the  constraints  above  all  deal 
with  image  features  which  are  quasi-mvariant  with  respect 
to  viewpoint  (eg.,  a  curve  break  in  three-space  produces 
an  image  curve  with  a  break  ovv.1  a  wide  range  of  view¬ 
ing  conditions,  even  though  the  angle  of  the  break  is  not 
invariant).  The  higher  levels  of  Acronym  produce  predic¬ 
tions  of  quasi- invariants  from  specific  object  models,  and 
these  should  interface  well  with  the  more  general  constraints 
described  above. 

The  list  given  above  is  far  from  complete.  We  arc 
particularly  interested  in  developing  new  constraints  for  in- 
fering  the  cross  sections  and  v  ilume  descriptions  for  the 
geometric  objects  in  an  imn.  Many  other  image  fea¬ 
tures  (such  as  parallelism)  are  only  slightly  less  invariant 
with  respect  to  viewpoint,  and  should  also  be  included  in  a 
general  system. 

An  Implementation 

,  vV'c  have  written  a  preliminary  version  of  a  computer 
piograin  which  implements  the  hypothesis  formation  sys¬ 
tem  and  many  of  the  constraints  described  above,  This  pro¬ 
gram  has  been  tested  on  simulated  image  curve  data  derived 
by  hand  from  a  real  image,  with  the  encouraging  results 
described  below.  We  are  ir  the  process  of  implementing 
more  constraints,  and  trope  to  soon  test  the  program  on 
data  derived  automatically  from  images  by  curve  detection 
programs  being  written  here  at  Stanford.  The  program  has 
been  implemented  ic.  Macusi’  on  a  DEC  KL-10,  using  the 
record  package  and  Acronym  environment  created  by  Rod 
Brooks  [2).  All  of  me  constraints  we  currently  use  arc  com¬ 
putationally  incx .vjnsivc,  and  the  hypothesis  formation  for 
the  example  give  u  on  the  following  pages  took  less  than  2 
seconds  of  com  pi  ;er  time. 

The  initial  curve  data  is  translated  into  a  set  of  “curve 
hypotheses”  in  order  to  initialize  the  hypothesis  generation 
proc.es.’.  Each  curve  is  represented  as  a  series  of  points 
with  tie  tangent  of  the  curve  given  at  each  point,  and 
cubic  splines  are  assumed  as  the  method  of  interpolating 
for  intermediate  points.  All  curves  arc  indexed  in  a  grid 
array  under  all  grid  squares  through  which  they  pass,  and 
it  is  therefore  economical  to  search  image  neighborhoods 
for  curve  features.  During  input,  each  curve  termination 
is  linked  to  other  curves  which  terminate  or  pass  through 


the  same  spot.  When  using  real  data  it  may  be  useful  to 
hypothesize  extensions  of  curves  which  do  not  terminate  at 
a  vertex. 

A  weakness  of  the  current  system  is  its  control  struc¬ 
ture,  Currently,  most  of  the  hypothesis  generation  proceeds 
in  a  fairly  fixed  order,  with  the  most  reliable  types  of 
evidence  being  examined  first.  However,  we  are  in  the 
process  of  creating  a  more  flexible  control  structure  using  a 
generalized  agenda  for  ordering  constraint  propagation.  The 
conflict  resolution  mechanism  also  requires  further  work  be¬ 
fore  it  can  be  applied  to  all  the  different  types  of  conflicts 
which  could  occur. 

In  spite  of  the  incomplete  state  of  the  current  system, 
it  is  able  to  carry  out  detailed  interpretations  from  simu¬ 
lated  curve  data  as  shown  in  Figures  2  through  10.  Figure 
2  shows  the  original  picture  taken  over  San  Francisco  air¬ 
port  from  which  the  curve  data  in  Figure  3  was  derived 
by  hand.  This  data  also  specifics  the  approximate  contrast 
across  edges.  In  deriving  this  curve  data  we  referred  to  data 
derived  automatically  by  curve  detection  programs,  and  we 
believe  data  of  at  least  this  quality  will  soon  be  available. 
We  have  also  input  the  location  or  the  sun,  although  we 
hope  to  soon  be  able  to  derive  the  illumination  direction(s) 
automatically  from  the  image.  Figure  <1  contains  a  circle 
over  the  location  of  each  hypothesis  corresponding  to  the 
unambiguous  continuation  of  a  curve  boundary  (constraint 
8  of  the  previous  section).  It  also  shows  the  places  at  which 
the  input  routines  segmented  the  curve  data.  Figure  5  shows 
the  hypotheses  regarding  curve  terminations  at  a  continuous 
curve  (constraint  3). 

Figure  6  shows  a  dotted  line  at  the  match  between 
each  hypothesized  shadow  discontinuity  and  its  correspond¬ 
ing  edge  discontinuity  in  the  sun  direction.  This  makes  a 
small  amount  of  „sc  of  the  input  data  regarding  the  con¬ 
trast  across  curves  (constraint  5),  but  the  major  support 
for  a  shadow  hypothesis  is  the  existence  of  one  of  these 
shadow  matches  (constraint  6).  We  expect  to  be  able  to 
use  tighter  tolerances  when  using  automatically  generated 
data  than  were  used  with  this  simulated  data,  with  cor¬ 
respondingly  better  results.  Figure  7  shows  curves  which 
have  been  hypothesized  to  be  shadows  as  dotted  lines,  and 
also  the  hypothesized  connections  between  shadow  lines  and 
the  geometric  edges  which  cast  them  (constraint  7).  These 
hypotheses  were  formed  from  the  evidence  of  Figure  6  in 
combination  with  other  projective  constraints  mentioned  in 
the  previous  section  and  hypotheses  regarding  which  edges 
constituted  geometric,  boundaries.  The  success  of  this  tech¬ 
nique  can  be  shown  in  the  closcup  view  of  the  tail  section 
in  Figure  8,  in  which  the  complicated  outline  of  the  shadow 
is  correctly  interpreted  as  the  intersection  of  shadows  from 
several  different  geometric  edges.  Further  hypotheses  are 
also  formed  regarding  the  planarity  of  regions,  aud  their 
direction  of  occlusion. 

From  the  results  of  this  interpretive  process,  it  is  pos¬ 
sible  to  construct  three-dimensional  models  of  the  sccue. 
The  shadow  tc  curve  matches  provide  information  on  the 
relative  distance  of  cacti  geometric  edge  and  shadow  surface 
from  the  camera.  Even  if  the  sun  position  is  not  known,  all 
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Fig  3  Original  digitiied  image. 


Fig  5  Terminations  at  continuous  curves. 


Fig  8  Closeup  view  of  tail  section  shadow  hypotheses. 


mcasuretaeaU  are  correct  relative  to  some  constant  factor 
and  can  lie  made  absolute  by  knowing  the  correct  measure- 
rjent  for  any  one  part  of  the  image.  Figures  9  and  10  show 
different  views  of  such  a  model  generated  automatically  from 
the  hypotheses  described  above.  Some  of  the  minor  errors  in 
this  model  are  actually  artifacts  of  the  display  system  rather 
than  being  present  in  the  actual  hypotheses.  The  position 
of  the  camera  v/ith  respect  to  the  ground  was  given  to  the 
modeling  system,  but  this  could  be  deduced  from  the  image 
if  the  orientation  of  any  image  object  with  respect  to  the 
ground  was  known.  With  our  digitized  data  it  was  impos¬ 
sible  to  sec  into  shadows  on  the  ground  because  of  digitizing 
limitations,  even  though  this  information  was  available  in 
the  original  negative.  Interpretation  would  have  been  easier 
with  more  complete  data 

There  is  much  more  information  available  from  the 
curve  data  than  v/e  have  used  in  this  example,  and  we  hope 
to  have  better  results  soon.  It  is  fairly  easy  to  deduce  the 
depth  or  objects  suspended  above  the  ground  from  the  posi¬ 
tion  of  shadows  which  are  cast  underneath  them,  and  we 
expect  to  be  able  to  infer  the  cross- sections  of  objects  in 
many  c nsec  from  other  clues.  The  interaction  of  these  in¬ 
terpretations  with  knovdedge  of  the  specific  objects  in  the 
scene  is  an  important  area  for  exploration. 

Conclusions 

The  reasoning  system  used  for  this  interpretation  task 
can  be  summarized  as  follows:  we  reason  forward  on  the 
assumption  that  no  coincidences  have  occurred,  but  Uxoays 
maintain  enough  information  so  that  hypotheses  can  be 
reevaluated  and  reversed  in  the  face  of  nets  contextual  evi¬ 
dence.  This  ^ves  us  the  ability  to  use  constraints  which 
will  sometimes  be  false,  and  therefore  allows  us  to  use 
many  sources  of  information  which  would  be  unacceptable 
in  a  strictly  deductive  inference  or  constraint  system.  This 
method  relies  ou  the  targe  amount  of  redundant  informa¬ 
tion  usually  available  for  v.non  tasks.  The  further  devel¬ 
opment  of  tools  for  reasoning  with  suggestive  ev;  ’ence  is 
an  important  goal.  Currert  programming  techniques  lack 
the  flexibility  needed  to  apply  constraints  of  this  sort  in  an 
optimal  manner  for  different  druses  of  input  data. 

One  objective  of  this  research  has  been  to  show  that 
image  intensity  discontinuities  carry  far  more  information 
than  has  often  been  recognized  in  the  past.  The  ability  to 
precisely  localize  features  is  of  great  importance  for  provid¬ 
ing  strong  constraints  on  interpretation.  Not  only  is  it 
valuable  tc  localize  intensity  discontinuities  as  accurately  as 
possible,  but  it  is  important  to  localize  curve  terminations, 
intersections,  and  tangent  or  curvature  discontinuities  ac¬ 
curately.  The  constraints  on  curve  interpretations  do  not 
need  to  rely  on  restrictive  assumptions  about  the  class  of 
objects  in  the  image 

More  weakly  Realizable  information,  such  as  shad¬ 
ing,  intensity,  and  color,  while  being  very  important,  is 
usually  dependent  on  other  assumptions  regarding  surface 
continuity.  We  believe  that  it  io  valuable  to  firsf  examine  the 
intensity  discontinuities  and  carry  out  interpretation  similar 
to  that  described  here  as  a  step  towards  using  these  other 


sources  of  information. 

Shadows  are  not  troublesome  features  to  be  removed 
from  an  image,  but  arc  in  fact  one  of  the  most  reliable 
sources  cf  low-level  information.  While  some  of  our  con¬ 
straints  on  shadow*  depend  on  properties  of  the  sud  and 
assume  a  known  location  of  the  light  source,  we  believe 
that  human  perception  in  geucral  makes  use  of  the  more 
qualitative  aspects  of  our  constiaints  on  the  interpretation 
of  illumination.  In  particular,  the  behavior  of  illumination 
boundaries  as  they  cross  surface  markings  and  geometric 
boundaries  and  terminate  at  geometric  vertices  pre  ’ides  a 
great  deal  of  information  even  '  r  unknown  conoit.ons  01 
illumination.  The  identification  o,  illumination  boundaries 
can  be  carried  out  at  a  Sow  level  by  noting  vhe  width  of 
boundary  transitions,  constraints  on  contrast  changes  along 
a  shadow  edge,  and  the  crossmg  of  continuous  carves. 
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ABSTRACT 

A  simple  mathematical  formalism  is  presented 
suggesting  a  mechanism  for  computing  relative 
depth  of  any  two  texture  elements  characterized 
by  the  same  relative  motion  parameters.  The 
method  is  based  on  a  ratio  of  a  function  of  the 
angular  velocities  of  the  projecting  rays  corre¬ 
sponding  to  the  two  texture  elements.  The  angu¬ 
lar  velocity  of  a  ray  cannot,  however,  be  com¬ 
puted  directly  from  the  instantaneous  characteri¬ 
zation  of  motion  of  a  "retinal"  point.  It  is 
shown  how  it  can  be  obtained  from  the  (linear) 
velocity  of  the  image  element  on  the  projection 
surface  and  the  first  time  derivative  of  its 
direction  vector.  A  similar  anclysis  produce? 
a  set  of  equations  which  directly  yield  local 
surface  orientation  relative  to  a  given  "isual 
direction.  The  variables  involved  are  scalur 
quantities  directly  measurable  on  the  projection 
surface  but,  unlike  the  case  of  relative  depth, 
the  direction  of  (Instantaneous)  motion  has  to  be 
computed  by  different  means  before  the  method  can 
be  applied.  The  relative  merits  of  the  two  for¬ 
malisms  are  briefly  discussed. 
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1.  INTRODUCTION 

Optical  flow  is  the  distribution  of  angular 
■velocities  o.''  projecting  rays  due  to  rvlat^va 
motion  of  objects  with  respect  to  the  observer. 
Conceptually,  optical  flows  undoubtedly  carry  a 
wealth  of  information  about  the  spatial  arrangement 
of  the  viewed  scene,  and  piot.tnent  psychologists 
nuch  as  Gibson  (1950,  1979)  have  argued  forcefully 
for  the  predominant  v.'Ole  of  this  information  in 
human  vision.  Discontinuities  in  the  distribution 
of  angular  velocities  have  been  showr  to  directly 
correspond  to  occluding  (or  self-occluding)  edges 
(Nakayam*  and  Loomis,  197*)),  Corresponding  dis¬ 
continuities  in  "retinal"  motions  thus  offer  power¬ 
ful  information  lor  segmental. lor  purposes.  Some 
recent  week  hat  illuminated  s rue  of  the  relation¬ 
ships  between  variabxes  directly  involved  in  the 
formation  of  optical  f Iras  (Kosn.Jevink  end  van 
Doom,  1975,  19?C»  1977s  Longuet-h.igglits  and 
Prazdny,  I960;  Ptczdny,  1960). 

The  purpose  of  this  paper  is  to  present  a 
mathematical  analysis  of  some  relitiomv  containing 
information  about  spatial  dispositions  of  a  set. 
of  texture  elements.  Using  the  concept  of  polar 
projection  as  the  model  for  the  physical  latg”- 
forming  process  we  show  that  the  relative  dejtt.h 
of  two  texture  elementc  con  he  computed  as  a 
simple  ratio.  The  entities  involved  are  the  angu¬ 
lar  velocities  of  the  rays  through  the  texture 
elements  and  the  center  of  projection,  and  the 
visual  directions  of  the  rays,  which  are  unit 
vectors  specifying  the  directions  of  the  rays  in 


some  egocentric  reference  frame  centered  at  the 
center  of  projection.  We  shew  that  the  angular 
velocity  at  an  image  location  can  be  obtained 
from  the  image  velocity  vector  and  its  first  time 
derivative  at  that  locus. 

A  sl<chtly  different  analysis  requiring  an  a 
priori  knowledge  ;>f  the  direction  vector  of  the 
translatory  component  jt  the  relative  motion 
leads  to  an  Interesting  characterisation  of 
iocul  surface  orientation. 

Underlying  out  research  the  interpretation 
of  image  motions  is  the  assumption  that  (an 
approximation  to)  the  velocity  vectors  associated 
with  individual  image  elements  can  be  obtained 
with  reasonable  accuracy  ,  Some  recent  research 
regarding  the  computation  of  "retinal"  velocities 
dire-tly  from  the  image  brightness  values  (Hadani 
et  al.,  1980;  Horn  and  ScHunck,  1980)  supports 
this  assumption.  Other  promising  support  comes 
from  the  research  on  discrete  solutions  to  the 
correspondence  problem  (Ullman,  1979;  Barnard  and 
Thompson,  1980)  which  relies  on  matching  various 
higher -level  image  "token"  structures. 

2.  THE  BASIC  IMAGE  FORMING  GEOMETRY 

To  began,  let  us  first  consider  motions  of 
the  projecting  rays  independently  of  any  parti¬ 
cular  projection  surface.  Refer  to  Figure  1.  A 
texture  clement  P  projects,  along  the  ray  OP,  into 
the  image  element  Q  on  the  unit  sphere  centered  at 
the  center  of  projection,  0.  As  the  object  moves 
relative  to  r.ne  observer,  the  ray  OP  (instanta¬ 
neously)  rotates  about  an  axle  through  0  C3>i3lng 
the  image  element  (J  to  trace  a  path  (T3)  on  the 
unit  ephere. 

Cone  idea  a  unit  vector  (Q)  determining  the 
direction  of  the  ray  OP  at  a  given  instunt 
r-Note  ».  Its  velocity  is  d/dt(Q)^Q.  Q  is  per- 
pendlcule.t  to  Q,  As  the  ray  moves  with  angular 
velocity  A,  Q  moves  along  T3  fitli  (linear)  velo¬ 
city  Q(»y' ).  The  two  velocities  are  related  by 
Q'AxO.  Later  ve  will  show  how  the  velocity  v" 
of  an  image  element  on  the  projection  plane 
relates  to  the  velocity  Q  of  Che  unit  vector  Q, 


For  the  moment,  it  suffices  to  note  that  the  equa¬ 
tion  Q“A*Q  does  not  determine  A  uniquely,  it  only 
constrains  A  to  lie  in  the  plane  a  normal  of  which 
is  Q  (see  Figure  8  for  further  explanation) . 

The  motions  of  the  object  ar-d/or  the  observer 
are  eiatlve.,  i.e,  ,  we  car  only  resolve  the  motion 
of  the  object  relative  to  the  observer  (or  vice 
vert.a) .  The  (instantaneous)  motion  of  an  object 
with  respect  to  the  observer  (in  the  reference 
frame  in  which  the  observer  is  stationary)  can 
always  be  described  as  a  rotation  (with  some  angu¬ 
lar  velocity  A^)  superimposed  on  a  translation 
specified  by  a  vector  v.  The  axis  of  rotation  can 
be  chosen,  without  loss  of  generality,  to  pass 
through  the  center  of  projection  (Chasles’  theorem) 
to  remove  thr  ambiguity  (Whittaker,  1944).  The 
total  (lineur)  velocity  of  an  environmental  point 
p  (with  position  vector  P)  is  then  Vyy+AxP^  An 
equivalent  expression  in  terms  of  the  angular  velo¬ 
city  of  the  projecting  ray  OP  is 

where  A^,  is  the  augular  velocity  due  to  the  trans¬ 
lation  a.lnne,  and  is  the  rotational  component 

(note  that  the  values  of  and  v  specify  relative 

ft 

motion,  and  not,  in  any  sense,  the  actual  3D  motion 
of  the  object).  The  simplicity  of  equation  (1) 
results  from  the  fact  that  angular  velocities  about 
a  common  point  add  vectorially  (Weatherburn,  1965). 
Observe  that  does  not  vary  from  point  to  point 
on  a  rigid  body;  it  is  a  property  of  the  body  as 
a  whole  and  independent  of  distance  or  visual 
direction.  Aj,  cn  the  other  hand,  is  a  function 
of  visual  direction  end  the  uistance  of  the  tex¬ 
ture  element  to  the  center  of  projection  (see 
below).  Of  these  two  component  fields,  only  Aj, 
carries  information  about  relative  distance.  We 
will  not  derive  an  expression  for  A^,  here.  The 
reader  is  referred,  for  example,  to  Nakayama  end 
Loomis  (1974)  or  Prazdny  0  980)  for  a  detailed 
diocussion.  Briefly,  when  the  object  translates 
relative  to  the  observer,  individual  rays  of 
projection  all  move  it.  the  same  plane  (different 
for  different  rays) .  The  direction  of  A^,  is 
normal  to  this  plane  spanned  by  the  vectors  v  and 
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Q,  The  magnitude  of  A^,  is  equal  to  d(5/dt  wbcrp.  0 
is  the  angle  between  the  direction  of  t -vine la t ion 
and  the  given  ray  (see  Figure  2>.  Is  then  given 
by 

(2)  Ajy*dp/dt  (vxQ/sin((5)*v/C  (v*Q) 

where  S“U)l|  is  the  distance  of  a  given  texcur  i 
element  to  the  center  of  projection. 

The  observation  that  enables  us  to  derive  an  ex¬ 
pression  for  the  relative  depth  of  any  two  texture 
points  movlrg  in  the  same  way  relative  to  the  obser¬ 
ver  concerns  equation  (1).  Consider  any  two  points 
P^,  Vj  on  the  same  object  -'•tfote  2>.  From  equation 
(1)  it  follows  that 

(3)  and 

We  see  that  because  A^  is  the  same  for  the  tv, 
points  it  cancels  out  when  the  angular  velocities 
are  subtracted: 

(*)  — j.  j  i~— j  j 

Using  (2)  and  substituting  ue  obtain 


a  set  of  non-linear  equations  whose  '■oeff icients 
are  the  velocity  vector  components  of  et.  least 
five  neighboring  image  elements  (Prazdny,  1^80) , 
or  use  the  first  end  second  spatial  derivatives 
of  the  image  velocity  field  to  obtalu  enough 
iniVmation  to  solve  for  and  v  directly  as  an 
integral  sten  in  computing  the  lucal  surface  ori¬ 
entation  (bonguet-Hi&gins  and  Prazdny,  1980). 
Ullican's  scheme  (UHman,  19/9),  which  uses  ortho¬ 
gonal  projection  and  relies  on  a  theorem  from 
affine  geometry  (the  structure-from-raotion  theorem 
<Note  3>),  usee  not  only  spatial  information  (mu¬ 
tual  position  of  a  set  of  image  elements)  but  also 
temporal  information  (the  relative  position  of  a 
given  image  element  in  successive  snapshots)  to 
recover  the  rotation  of  the  configuration  prior  to 
che  computation  of  relative  depth  Fsee  Meirl 
(1Q80)  for  some  comments  on  the  number  of  image 
points  and  snaprhoto  necessary  to  solve  for  the 
relative  motion  parameters] .  Of  course,  the 
assumption  oi  orthogonal  projection  works  only  f c: 
some  situations. 


(5)  Ay-w>'(k1<}1-k.jQj) 

where  k^“v/S^.  We  form  the  scalar  product  of  both 
sides  of  (4)  with  (kjO^-U  to  obtain 

vb)  '  V  -kj  (-A-ij 

This  is  because  the  scalar  triple  product  involv¬ 
ing  one  vector  twice  is  always  zero.  We  set  a^« 
cud  substitute  back  into  (5): 


ai/aj’Vsj 


in  or.her  words,  the  relative  depth  of  any  two 
points  having  the  same  relative  motion  parameters 


In  contrast,  equation  (6)  above  vnly  requires 
that  the  angular  velocities  (A^.A^)  at  two  visual 
directions  (Q^ ,Qj )  be  inown .  Unfortunately,  the 
angular  velocity  at  a  "retinal"  locus  cannot  be 
computed  from  tne  information  available  at  that 
locus  at  an  instant.  The  equation  v'*-AxQ  dees  not 
specify  A  uniquely  (see  also  Figure  8).  The  angu¬ 
lar  velocity  of  a  ray  through  an  image  point  can 
be  obtained  only  wher.  some  additional  information 
is  available.  We  shew  that  it  can  6e  computed 
when  the  vector  specifying  the  time  rate  of  change 
in  the  direction  of  the  "retinal"  velocity  is 


(v^  and  kg)  is  computable  as  a  simple  ratio.  Ob¬ 
serve  tliat  (6)  does  not  involve  A^  or  v;  relative 
depth  can  be  computed  independently  of  relative 
motion.  This  is  a  simple  but  important  finding. 
In  general,  all  mathematical  formalisms  for  com¬ 
puting  surface  orientation  or  3D  structure  from 
notions  on  the  projection  surface  depend  (impli¬ 
citly  or  explicitly)  or  computing  the  rotational 
component  of  the  relative  motion  first  (e.g., 
Uiltaan,  1979;  Longuet-Kiggins  and  Prazdny,  1980; 
Fruzdny,  1980).  To  obtain  A^,  one  baa  to  solve 


available. 

To  see  this  consider  Figure  3.  Due  to  (rela¬ 
tive)  motion  of  a  texture  clement,  the  ray  speci¬ 
fied  by  the  direction  vector  Q  rotates  about  0  so 
that  Q  moves  on  the  surface  of  the  unit  spotre 
with  some  velocity  v’«Q.  To  find  the  instantaneous 
plane  of  rotation  of  0  it  is  sufficient  to  apply 
a  few  concepts  from  elementary  differential  geo¬ 
metry.  Oboerve  that  v',  the  uui.t  vec.to::  in  the 
direction  of  Q,  is  the  unit  tangent  to  the  path 
at  Q.  This  means  that  v'  is  in  the  direction  of 


the  principal  normal  of  Q.  Together,  v'  and 
v'  span  the  plane  on  which  lies  the  circle  of 
curvature  at  Q.  In  other  words,  the  plane  a  normal 
of  which  is  v'x  v'  (this  vector  lies  in  the  direc¬ 
tion  of  the  binormal  vector  at  Q)  is  the  plane  of 
instantaneous  rotation  of  Q,  and  A,  the  angular 
velocity  vector  of  Q,  must  lie  in  the  direction  of 
this  vector.  Observe  that  here  we  bring  in  temporal 
information  to  obtain  the  angular  velocity  vector. 
Other  kinds  of  additional  information  are  possible. 
In  the  next  section  we  show,  for  completeness,  how 
v'  and  v'  relate  to  "retinal"  variables  when  the 
projection  surface  is  a  plane.  Then  we  analyze  a 
methoJ  for  computing  local  surface  orientation 
directly,  without  computing  the  relative  depth 
map  first. 

3.  COMPUTING  THE  ANGULAR  VELOCITY  OF  A  PROJECTION 
RAY  FROM  "RETINAL"  VARIABLES 

In  this  section  we  assume  that  the  projection 
surface  is  a  plane  at  unit  distance  from  0.  As 
the  projecting  ray  rotates  about  0  with  some 
angular  velocity  A,  Q  moves  with  velocity  v'  along 
T3,  and  »  QQ,  the  point  at  which  the  ray  pierces 
the  projection  plane,  moves  with  velocity  v"  along 
T2  (see  Figure  4).  Observe  also  that  T3  is  a 
perspective  transformation  of  T2  and  vice,  versa. 

For  example,  if  T3  is  a  circle,  T2  could  be  any 
conic  section.  The  exact  type  of  the  curve  will 
depend  on  the  mutual  orientation  of  the  plane  con¬ 
taining  T3  (determined  by  the  direction  of  the 
angular  velocity  vector  A)  and  PP.  As  mentioned 

above,  the  angular  velocity  vector  A  is  normal  to 
the  instantaneous  plane  of  rotation  of  the  ray  OP, 
i.e..  parallel  to  the  binormal  vector.  Our  first 
task  :is  thus  to  determine  the  direction  of  the 
hinormal  vector  associated  with  the  motion  of  Q. 
First,  we  express  Q-v'  in  terms  of  the  velocity  of 
the  image  element  at  a  given  point,  and  then  we 
will  find  the  direction  of  A  as  the  vector  product 
of  v'  and  v'.  We  will  establish  a  few  interesting 
auxiliary  relations  before  proceeding  further. 

Consider  Figure  5.  The  figure  illustrates 
the  fact  that  the  projection  of  a  segment  with 
magnitude  v'  along  a  perpendicular  to  a  given  line 
£<2  which  makes  an  angle  X  with  line  l  is  v"  = 


v'Q/(sin(X)-v'ccs(X)) .  We  will  refer  to  this 
equation  as  the  "radial  projection  equation." 

Next  we  will  establish  a  rather  surprising  fact 
about  the  relation  between  the  velocity  of  a  point 
and  the  velocity  of  its  projection.  We  will  first 
consider,  for  simplicity,  only  planar  motions, 
but  the  relation  holds  for  space  motions  too  (see 
below).  Consider  Figure  6.  Suppose  that  the  point 
'a'  moves  along  a  circular  trajectory  C  so  that  its 
distance  to  0  remains  unchanged.  The  (infinitesi¬ 
mal)  displacement  is  dtp.  The  displacement  of  the 
projection  of  'a'  on  £.^  is  tan(d9) .  The  displace¬ 
ment  of  the  projection  of  ’a'  on  is  Qtan(dcp)/ 
[sin(X)-cos(X)  tan(d©)  ]  (using  the  radial  projection 
equation) .  To  compute  the  relation  between  the  velo¬ 
cities  on  C  and  we  divide  by  dt  and  take  the 
limit  as  dt-K): 

,  .  Qtan(dT>) _  J_. 

lsin(X)-cos(X)tan(d4>)  dtJ 

Q 

=  9  "SlnTO 

This  is  because  q>(t)  is  a  continuous  function  of  t, 
and  lim[tan(x) /x]“l  as  x-K).  This  means  that  when 
a  point  'a'  moves  along  a  path  with  speed  v’,  its 
projection  on  the  line  moves  with  speed 

(8)  v"=Qv' / sin(X) 

In  other  words,  the  velocity  of  the  projection  of  a 
point  moving  along  a  curve  Is  not  the  projection 
of  the  velocity  with  which  the  point  moves  along 
that  curve  <Note  10>.  This  relation  holds  also  in 
3D  space  <Note  4>.  Next  we  will  derive  the  equa¬ 
tion  which  will  enable  us  to  express  the  angular 
velocity  in  terms  of  the  image  velocities. 

Refer  to  Figure  4.  As  Q  moves  along  T3,  its 
projection  QQ  on  PP  moves  along  T2  with  velocity 

(9)  v"»d/dt(QQkQQ+QQ-QQ+Qv' 

along  T2.  Observe  that  v" ,  v ' ,  and  Q  all  lie  in  the 
same  plane  <Note  4>.  This  means,  however,  that  the 
following  two  vector  equations  hold  simultaneously: 

(10)  v'x  Q»(v"x  Q)/sin(X)  and  v'.Q»0 

We  can  now  solve  for  v'  from  these  two  simultaneous 
vector  equations  <Note  6>  to  obtain 

(11)  v'"Qx  (v"xQ)/sin(X)»csc(X)v"-cot(X)Q 
This  is  the  main  equation.  Note  that  v'  is  the 
unit  tangent  to  T2  at  Q.  Its  first  time  derivative, 
v1,  thus  lies  along  the  principal  normal  to  T3  at  Q. 
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Their  vector  product  in  turn  specifies  the  direction 
of  the  sought  angular  velocity  vector  A.  Differ¬ 
entiating  (11)  we  obtain 

(12)  v'»csc(X)'-  '«Xcsc(A)  f  cos( A)v"-Q}-cot  (X)v ' 
Taking  the  vector  product  of  (12)  with  y'  and  using 
the  relation  ’•  *  v“cos(X)  (v'xq)  leads  to 

(13)  vxv '*csc(X) (v ' x  v")+X(v'  x Q) 

Substituting  for  v'  from  (11),  for  v'  from  (0),  and 
simplifying,  we  finally  obtain 

(14)  v  'x  v'«Qcsc(X)^(v"-cos(X)Q)xv"+<JXcsc(X) 

Now  A-d/dt(A)  can  be  expressed  in  terms  of  v"  and  the 
relation  between  the  visual  direction  Q  and  the  pro¬ 
jection  plane  PP  (its  unit  normal).  Using  v'.Q-cos(X) 
and  differentiating  we  obtain 

(15)  X--csc(X)  (v"  .Q+v”  ,Q)»- csc(X)  <t"  .v’ ) 

However,  v" .v'=(v' sin(X)+Qcos(X) ) .v'”v' sin(A) , 
because  v'.Q=0by  definition.  Substituting  for  v’ 
from  (8)  yields 

(16)  A--csc(A)  (v".Q)“v’'sin(X)/Q 

Equations  (14)  and  (16)  indicate  that  the 

direction  of  the  angular  velocity  vector  at  a 
visual  direction  Q  can  be  determined  completely 
once  the  image  velocity  vector  v"  at  the  locus  on 
PP  corresponding  to  the  visual  direction  Q,  and 
the  first  time  derivative  of  the  direction  vector 
of  v",  are  available.  The  only  other  quantities 
entering  the  equation  are  Q  and  X,  expressing  t  ie 
metrics  of  the  projective  system  <Note  6>. 

Observe  that  the  direction  of  the  vector  v"  is 
known  as  soon  as  v"  is  known.  Because  T2  is  a  planar 
motion,  v"  lies  in  PP,  and  is  perpendicular  to  y”. 
Referring  to  Figure  7,  we  see  that  the  direction  of 
v",  v",  is  specified  by  v"-cos(r|)x  +  sin(r|)y  where 
x  and  y  are  a  set  of  mutually  perpendicular  unit 
vectors  on  PP.  v"  is  thus  given  by 

(17)  v"-n[-sin(ri)x  +  cos(ri)y3 
and  the  magnitude  of  v“  is  n. 

Finally,  it  remains  to  specify  the  magnitude  of 
A.  Observe  that  A,  the  magnitude  of  A,  is  a  function 
of  the  mutual  orientation  of  A  and  the  direction 
vector  Q.  This  is  related  to  the  already  mentioned 
fact  that  v1  and  Q  do  not  specify  A  uniquely  (see 
Figure  8),  Using  v'-AxQ  we  Bee  that  y'«v'v'"A(A  x  q) 
=Asin(u)v',  and  from  thiB  it  follows  directly 

that 

(18)  A“v'/sin(u>)“v"sin(A)/  [Qsin(w)] 


Here, u is  the  angle  between  A  and  Q,  and  cos(w)“A.Q. 

4.  COMPUTING  LOCAL  SURFACE  ORIENTATION 

In  this  section,  we  analyze  a  method  of 
computing  local  surface  orientation  relative  to  a 
given  visual  direction.  An  interesting  result  will 
be  that  the  directions  o.  angular  velocity  vectors 
are  not  required  explicitly.  However,  we  cannot 
obtain  something  for  nothing;  the  analysis  requires 
an  a  priori  knowledge  of  the  (instantaneous) 
direction  of  motion  (the  direction  of  the  tran  - ln- 
tory  component  of  the  relative  motion) . 

First,  let  us  express  vectors  in  a  Cartesian 
(rectangular)  coordinate  frame  as  a  function  of 
two  angles  a  and  8.  Then,  for  a  given  visual 
direction  Q(a,£),  we  can  compute  3A/3a  and  3 A/ 3 £ • 
Using  equation  (3) ,  it  is  easy  to  see  that 

(19)  SA/SamSAj./Sa  and  3A/3B-3AJ./3B. 

In  other  words,  the  information  contained  in  the 
gradient  of  the  angular  velocity  field  in  the 
given  visual  direction  is  equivalent  to  the  infor¬ 
mation  contained  in  the  gradient  of  its  translatory 
component.  This  is  not  surprising,  for  as  we  saw 
above,  only  the  translatory  component  carries  in¬ 
formation  about  depth  relations  between  the  3D 
texture  elements. 

Let  us  find  expressions  for  3^/301  and  ^Aj/SB 
and  analyze  them  to  see  how  local  surface  orienta¬ 
tion  is  specified  in  these  expressions.  If  a  and  £ 
are  chosen  such  that  the  vector  corresponding  to 
a  "0  and  £ ”0  specifies  the  direction  of  the  trans¬ 
latory  component  y,  we  see  that  Aj,  does  not  change 
as  we  move  c.i  the  plane  a«const.  Using  this  fact 
and  differentiating  (2)  with  respect  to  8  yields 

(20)  3At/3£-3At/3B  Aj. 

Now,  £  “vsin(£)/S  (see  Figure  2),  and  3Aj,/3£*3£/3£. 
Thus, 

*  *  JiC 

(21)  3 £/ 3 £»vco s(£)/S-£Sg/S  where  Sg-  -jg 
Multiplying  both  sides  by  tan(B),  and  recalling 
that  £”vsin(B)/S,  we  see  that 

tan(8) 3£/3£“B-£tan(B)  Sg/S  so  that 

(22)  Sg/S--l/£[3£/3£]+cot(£) 

To  derive  a  similar  expression  for  S^/S  requires 
a  little  more  ingenuity.  Differentiating  equation 
(2)  with  respect  to  a  yields 
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(23)  3AT/3a--sin(g)  lv/S]  [Sa/S]  [v/S]  (vxQ^)  component  vectors  being  tangent  to  hyperbolas 

But  we  have  also  through  given  loci)  due  to  rotation  about  an  axis 

(24)  3AT/3a"'d(A^AT)/3a«3AT/3ciAT+A_3A,1,/3a”30/3a/tj,+X  through  the  center  of  projection  parallel  to  the 


where  X  is  some  vector  which  does  not  have  to  be 
specified  in  detail  (for  our  purposes).  It  can  be 
seen  immediately  that 

(25)  31V 3a—  [Sa/Slv  sin(B)/S 

and  from  thin  >  '...  .pecified  as 

c 

(26)  *a/c--  (i/sSjdB/Sa 

The  quantities,  s  -S  and  S^/S  are  depth  invariant 
ehajacrarizations  of  (local)  surface  orientation 
relative  to  a  particular  visual  direction  Q(a,B) 
<Note  8>.  In  fact,  they  are  directly  related  to 
the  gradient  of  the  distance.  Because  of  the  de¬ 
pendence  of  this  specification  of  local  surface 
orientation  on  a  particular  visual  direction 
(defining  the  surfaces  of  constant  a  and  8),  two 
different  surface  orientations  cannot  be  directly 
compared.  To  do  so,  one  could  transform  one 
characterization  into  another  using  a  simple  rota¬ 
tion  matrix.  It  is  important  to  realize  that  the 
expressions  characterizing  the  (local)  surface 
orientation  in  a  pure  translatory  situation  hold 
also  in  a  general  situation  of  a  curvilinear  motion. 
The  only  prerequisite  is  (as  in  the  pure  trans¬ 
lation  case)  that  the  direction  of  the  instantaneous 
motion  (i.e,,  the  direction  of  the  translatory 
component  of  the  curvilinear  motion)  is  known. 

The  knowledge  of  v  allows  us  to  define,  for  each 
"retinal’1  locus,  a  direction  along  which  a-const. 

By  projecting  the  "retinal"  velocity  vector  into 
this  direction  we  obtain  B,  and  by  differentiating 
these  "retinal"  velocities  along  the  directions 
a  -const,  and  B-const.  (see  Figure  10)  we  obtain 
33/da  and  3B/3B. 

If  this  process  is  to  be  carried  out  on  the 
projection  plane,  the  best  thing  to  do  is  to  locate 
the  focus  of  expansion  which  then  determines  the 
lines  a-constant  as  the  lines  joining  the  focus  of 
expansion  and  the  particular  "retinal"  locus.  The 
localization  of  the  focus  of  expansion  (FOE)  is  a 
difficult  task.  One  may  try  to  decompose  the  image 
velocity  field  there  into  its  constituents.  The 
rotational  component  of  the  angular  velocity  vector 
at  a  given  instant  can  be  decomposed  into  two  image 
velocity  fields:  one,  a  hyperbolic  field  (the 


projection  plane,  and  the  other,  a  circular  field 
(the  component  vectors  being  tangent  to  circles 
with  centers  at  the  center  of  the  image  coordinate 
system)  due  to  rotation  about  the  normal  to  the 
projection  plane.  The  remaining  translational 
component  would  consist  of  vectors  defining  the 
focus  of  expansion  (possibly  at  infinity)  as  the 
unique  intersection  of  all  straight  lines  defined 
by  these  vectors  and  the  corresponding  "retinal" 
loci.  The  whole  process  is  essentially  a  constrained 
minimalization  problem  of  a  function  of  three 
variables:  the  magnitude  of  the  circular  component, 

the  direction  of  the  hyperbolic  field  (specified 
by  a  single  angle) ,  and  the  magnitude  of  the  hyper¬ 
bolic  field  (see  Figure  11).  The  constraining 
condition  is  that  the  straight  lines  cf  the  trans¬ 
lational  component  meet  at  FOE.  Note  tliat  such  a 
process,  if  successful,  would  essentially  recover 
the  translational  as  well  as  the  rotational  com¬ 
ponent  of  the  angular  velocity  vector  A  (see 
equation  (1)).  We  are  currently  trying  to  solve 
thi3  problem  using  a  relaxation  scheme.  <Note  9>. 

5.  CONCLUSION 

In  this  paper  we  have  shown  that  the  relative 
depth  of  any  two  texture  elements  moving  in  the 
same  way  relative  to  the  observer  can  be  computed 
in  a  simple  way  from  the  angular  velocities  of  the 
corresponding  rays  of  projection.  We  have  illus¬ 
trated  how  the  required  angular  velocity  at  a  point 
on  the  planar  projection  surface  can  easily  be 
computed  from  the  (linear)  velocity  of  the  image 
element  at  that  locus,  and  the  first  time  deriva¬ 
tive  of  its  direction  vector.  We  have  also  shown 
that  local  surface  orientation  can  be  obtained 
rather  straightforwardly  once  the  direction  of  the 
translatory  component  of  the  relative  motion  is 
known.  The  recovery  of  this  direction  from  the 
information  contained  in  the  distribution  of  the 
"retinal"  vulocities  is  a  rather  complicated  task. 

It  ir  hoped  that  it  may  be  possible  to  decompose 
the  Instantaneous  velocity  field  on  the  projection 
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plene  into  Its  constituents  using  a  relaxation 
process.  Some  work  on  this  problem  is  currently  in 
progress  in  our  laboratory. 

The  applicability  of  the  method  will  depend  on 
the  accuracy  with  which  the  image  velocities  can 
be  obtained.  It  remains  to  be  specified  how  these 
errors  will  propagate  through  the  equations  and 
affect  the  accuracy  of  the  computed  relative  depth 
and  local  surface  orientation. 

The  computation  of  relative  depth  and  local 
surface  orientation  were  presented  as  two  distinct 
processes.  This  docs  not  have  to  be  so.  Local 
surface  orientation  may  be  obtained  from  a  relative 
depth  map,  for  example,  by  simply  fitting  a  plane 
to  a  set  of  relative  depth  values  in  a  given  (snail) 
neighborhood.  I  believe  that  the  relative  depth 
map  is  practically  a  much  more  useful  construct  than 
local  surface  orientation.  Because  the  available 
data  are  noisy,  the  computation  of  local  surface 
orientation  relying  on  quantities  obtained  in  a 
small  neighborhood  of  a  "retinal"  point  is  likely 
to  be  affected  much  more  than  the  relative  depth  of 
two  widely  separated  points,  where  the  two  angular 
velocities  can  be  obtained  much  more  precisely, 
e.g.,  by  averaging  over  a  small  neighborhood, 
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NOTES 
<Note  1> 

The  following  notation  is  used  through  the  paper. 

If  v  is  a  vector  then  v  is  a  unit  vector  in  its 
direction,  and  v  is  the  magnitude  of  v.  The  scalar 
product  is  denoted  by  and  the  vector  pro¬ 

duct  by  "X".  All  velocities,  position  vectors,  and 
the  associated  quantities  are  functions  of  time. 

This  is  assumed  implicitly  throughout  the  paper, 
v  and  v  denote  the  time  derivatives,  as  usual. 

Angular  velocities  are  vectors  perpendicular  to 
the  plane  of  (instantaneous)  rotation,  with  magni¬ 
tude  equal  to  angular  speed  ( radians/ sec ) . 

<Note  2> 

The  texture  elements  can  be  on  two  different  objects 
as  long  as  the  objects  move  in  the  same  way  relatively 
to  the  observer  (i.e.,  have  the  same  v  and  Ag) .  Thus, 
for  example,  in  the  stationary  world  where  the  ob¬ 
server  is  the  only  moving  agent,  the  relative  depth 
of  all  texture  elements  can  be  recovered  using  the 
present  method. 


53 


may  all  neat  at  an  (Ideal)  point  at  infinity.  It 
la  rather  difficult  to  Incorporate  this  condition 
into  a  nicely  behaving  criterion  function. 

<Note  10> 

This  result  is  intuitively  rather  surprising.  It 
follows  directly,  however,  from  the  definition  of 
the  angular  velocity  (see  Figure  2). 


X  x  A-B  and  X.£«p 

where  A.£  j*  0,  has  a  general  solution  X“(pA+C)ffl)  /  (A.£) . 
<Note  6> 

Given  the  unit  normal  m  defining  the  projection  pxane 
PP,  these  quantities  are.  computable  easily  as 

()-l/(Q.m) ,  and  X  is  given  by  cos(X)-(Q.  v") . 

<Note  7> 

As  can  be  seen  in  Figure  9,  Ay  is  the  same  for  all 

Q(a,0)  on  the  plane  a -const . ;  it  is  the  unit  normal 

3Sr 

of  this  plane.  It  follows  directly  that  -  0. 

dp 

<Note  8> 

Expressions  similar  to  (22)  and  (26)  were  also  In¬ 
dependently  obtained  by  Clocksin  (1930)  using  a 
different  approach. 

<Note  9> 

While  the  problem  is  conceptually  rather  simple, 
there  are  some  difficulties  relating  to  the  form¬ 
ulation  of  the  criterion  function  to  be  minimized. 

The  difficulty  is  related  to  the  fact  that  the 
projection  plane  is  an  augmented  Euclidean  plane, 
in  terms  of  projective  geometry.  For  example, 
the  translational  vector  components  on  the  plane 


<Note  3> 

The  atructurc-from-motlon  theorem  states  that  the 
relative  depth  of  four  non-coplanar  points  is 
recoverable  from  three  non-degenerate  orthographic 
projections.  The  mutual  orientation  of  the  pro¬ 
jection  planes  has  to  be  determined  before  the  actual 
relative*  depth  of  the  four  points  can  be  computed. 

The  recovery  of  the  mutual  orientation  of  the  pro¬ 
jection  planes  is  an  integral  part  of  the  schema. 
<Note  4> 

To  see  this,  note  that  v>d/dt(QQ)-QQ+QQ 
Now  ;$-v'  and  v' .Q-0.  Thus  v"XQ-Q(v ’.\q)  ,  l.e.,  v ' , 
v",  and  Q  all  lie  in  the  same  plane.  Setting  n  to 
be  the  unit  normal  of  this  plane,  we  have  v/XQ-v'n; 
but  also  v"X(j-v"sin(X)n  (see  Figure  4).  Thus 
n-(v'xq)/v'-(v"XQ)/v"8in(X) .  Substituting  for 
v'5<Q  we  obtain  (v'XQ)/v'-Q(v'XQ)/v"sin(X)  and  so 
v ’-v"sin(X)  /Q,  as  s'  ated  in  (8). 

<Note  5> 

A  set  of  vector  equation  of  the  form 
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Figure  1.  A  texture  point  P  projects  into  a  point  Q  on  the  unit 
sphere.  The  direction  vector  of  the  projecting  ray  OP  is  determined 
by  the  two  angles,  a,  the  meridian,  and  B,.  the  eccentricity;  the 
vector  Q  is  a  function  of  a  and  B.  The  plane  o»0  and  the  direction 
(a=0,B=0)  are  arbitrary,  but  it  is  advantageous  for  the  future 
analysis  to  choose  them  so  that  the  principal  x-axis  coincides  with 
the  direction  vector  of  the  translatory  motion  component. 


P 


Figure  2.  The  angular  velocity  of  a  ray  due  to  a  pure  translation. 
The  angular  speed  is  defined  as  dB/dt,  i.e.,  as  the  projection  of 
v  on  the  perpendicular  to  the  ray,  divided  by  the  distance  S. 
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Figure  3.  To  compute  the  direction  of  the  angular  velocity  vector 
of  the  ray  specified  by  the  visual  direction  0,  observe  that  Q  moves 
{ on^a  3D  path  on  the  surface  of  the  unit  sphere)  wit)j  velocity 
v'»Q.  Because  v*  is  the  unit  tangent  to  this  path,  v  lies  in  the 
direction  of  the  principal  normal  to  the  path  at  Q.  The  unit  bi¬ 
normal  vector  at  C,  which  is  perpendicular  to  the  plane  spanned  by 
v*  and  V • ,  is  parallel  to  the  angular  velocity  vector  A  of  Q. 


✓  '  J. 
/ 


qqJ^ 


Figure  4.  The  basic  projection  geometry.  The  ray ,  determined  y 
its  direction  vector  Q,  moves  due  to  the  relativejnotion  of  the 
object  with  respect  to  the  observer.  The  point  QQ-Q  at  which  it 
pierces  the  planar  projection  surface,  PP,  describes  a  planar  trajec 
tory  T2.  The  unit  vector  Q  describes  a  3D  trajectory  T3.  The  angle 
X  is  the  angle  between  the  image  velocity  v"  of  Q,  the  projection 
of  P  onto  PP,  and  the  ray  OP. 


Figure  f>.  The  relation  between  an  infinitesimal  displacement  d<f 
along  the  circle  C,  and  its  projection  on  the  line  ip.  The  speed 
at  which  b.  the  projection  of  a  on  ip,  moves  along 

is  not  the  projection  of  the  speed  with  which  a  moves  along  C. 
See  text  for  further  explanation. 

y 

v" 

V 

\ 

\ 
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Figure  7.  The  direction  of  on  PP  is  determined  by  an  angie  n. 
The  first  time  derivative  of  v"  has  direction  perpendicular  to  v", 
and  magnitude  n.  x  and  y  are  a  set  of  mutually  perpendicular  unit 
vectors  on  the  projection  plane  PP. 
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igure  8 .  Knowing  that  the  point  Q  (with  position  vector  0)  moves 
ith  some  (linearj  velocity  v  does  not  specify  the  angular  velocity 
of  the  ray  OQ.  The  equation  v--A*Q  constrains  A  to  l'e  in  the  plant 
of  which  v  is  a  normal.  Q  could  (instantaneously)  mov/e  on  an  infinite 
number  of  possible  circles  of  rotation,  only  three  of  them  being 
shown.  Observe  that  A,  the  magni tude  of  A,  dependr- 
on  the  angle  between  Q  and  A  (it  determines  the  radius  of  the  in¬ 
stantaneous  circle  of  rotation). 


Q 


C:  C, 


Figure  10.  For  each  given  visual  direction  Q(a,6),  the  circles 
a=const.,  p-const.,  and  y  itself  define  a  rectangular  coordinate 
system.  Observe  that  while  the  condition  a-const.  defines  a  plane, 
£=const  defines  a  cone  with  apex  at  O!  The  circles  on  the 
plane  and  Cj  on  the  cone  are  mutually  perpendicular. 
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, ,  ,n  4mac,„  velocity  v"  (on  the  planar  projection  sur- 
ac"e~PP Y~o f  a  point  Q  can  be  th^rav  "'about' *n 

yperbolic  component  hxsdu:  to  the  rotatxon  of  v;locity 

xis  (through  0)  in  the  P*°£eC“On  Pi  The  circuiar  component  c 
S  2  U?ea^rrotationnof  the  ray  ibout  an  axis  (through  o)  paral- 
el^  S  The  translational  component  Ms  the 

:shttats;rtpps  Tt .  s .  x ^ 

llus tration^above,  the  direction  angle  of  the  hyperbolic  field 
s  zero  (measured  anticlockwise  from  the  x-axis) . 
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ABSTRACT 


A  technique  to  analyze  patterns  in  terms  of 
individual  texture  primitives  and  their  spatial 
relationships  is  described.  The  technique  is 
applied  to  natural  textures.  The  descriptions 
consist  of  the  primitive  sizes  and  their 
repetition  pattern  if  any.  The  derived 
descriptions  can  be  used  for  recognition  or 
reconstruction  of  the  pattern. 

INTRODUCTION 

Areas  of  an  image  are  better  characterized  by 
descriptions  of  their  texture  than  by  pure 
intensity  information.  Texture  is  most  easily 
described  as  the  pattern  of  the  spatial 
arrangement  of  different  intensities  (or  colors). 
Tile  different  textures  in  an  image  are  usually 
very  apparent  to  a  human  observer,  but  automatic 
description  of  these  patterns  has  proved  to  be 
very  complex.  In  this  research,  we  are  concerned 
with  a  description  of  the  texture  which 
corresponds,  in  some  sense,  to  a  description 
produced  by  a  person  looking  at  the  image. 

Many  statistical  textural  measures  have  beer 
proposed  in  the  past  [1-4].  Reference  1  gives  a 
good  review  of  various  texture  analysis  methods, 
Among  the  statistical  measures  which  have  been 

discussed,  and  used,  analysis  of  generalized 
gray-level  co-occurrence  matrices  [2],  analysis  of 
edge  directions  with  co-occurrenco  matrices  [3], 
and  analysis  of  the  eaaes  (or  micro-edges)  in  a 
subwindow  [A).  The  statistical  methods,  by 
themselves,  do  not  produce  descriptions  in  the 
form  which  we  desire.  Some  of  the  measures  may 
indicate  certain  underlying  structures  in  the 
pattern,  but  do  not  produce  a  general  description. 
The  Fourier  transform  has  been  used  to  determine 
some  structural  descriptions  but  was  only 
partially  successful  for  more  complex  patterns 
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The  work  in  what  can  be  called  structural 
texture  description  has  been  more  limited  [6-9]. 
Maleson  [6]  used  simple  regions  as  the  basic 
elenents  and  used  relations  between  regions  and 
shape  properties  of  the  region  in  his  analysis. 
Rosenfeld  [7]  has  proposed  a  texture  analysis 
method  also  based  on  primitive  regions  ext.racted 
by  a  threshold  operation.  Tamura  et  al.  [8] 
tried  to  develop  a  set  of  operators  which  would 
rate  textures  on  several  scales,  comparable  to 
their  ratings  by  human  s'jbjects.  The  proposals  of 
Marr  [9]  for  texture  analysis  based  on  the  primal 
sketch  are  similar  to  some  of  the  analysis  vrtiich 
we  perform. 

This  paper  builds  on  work  reported  earlier 
[10,11]  and  thus  we  will  only  outline  the  early 
processing  steps.  We  '.nil  present  more  detail  on 
determining  spatial  relations  between  primitives 
and  generation  of  textures  from  the  descriptions. 
Finally,  we  comment  on  the  use  of  the  descriptions 
for  recognition  and  the  computation  of  texture 
gradients. 

SYMBOLIC  DESCRIPTIONS 

The  goal  of  this  work  is  a  description  of  a 
texture  pattern  based  on  primitive  elements  and 
the  spatial  arrangement  of  Ux>se  elements.  For 
example,  a  checker  board  pattern  might  be 
described  as  approximately  square  elements, 
alternately  light  and  dark  in  perpendicular 
directions.  Absolute  intensity  (.color) 
information  is  an  unreliable  feature  Cor  defining 
texture  primitives.  Instead,  we  have  chosen  to 
analyze  edge  images  to  initially  determined  the 
strix'ture. 

Tne  eage  repetition  arrays  (ERAs)  are  similar 
to  gray  level  co-occurrence  matrices,  but  differ 
in  many  respects.  The  important  feature  which  we 
wish  to  xtract  is  the  repetitive  nature  of  the 
tewture  patterns.  This  is  apparent  when  a 
sequence  of  spacirgs  (e.g.  2-32)  is  considered. 

Thus  the  values  which  we  wish  to  compute  are  how 
e-jge  elanents  repeat  as  a  function  of  distance  for 
a  given  angle. 

We  accumulate  edge  repetition  information 
(for  edges  the  same  direction,  and  edges  opposite 
directions)  for  the  six  edge  directions  (0 P,  30P, 
60°,  etc.)  and  for  both  colors  ("light"  and 
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"dark")  and  for  spacirgs  of  2  to  22  or  more.  The  or  there  are  several  different  types  of  elements 
data  for  J  particular  edge  direction  are  (e.g.,  three  with  different  intensities)  which 

accumulated  by  scanning  the  image  in  a  direction  compose  the  texture  pattern.  Therefore,  instead 

perpendicular  to  tire  edge  direction.  The  s'ms  are  of  generating  texture  elements  as  a  first  step,  ue 
normalized  to  give  a  .alue  whicn  is  the  look  for  texture  elements  with  certain  known 

probability  of  a  pair  of  edges  of  the  proper  properties,  the  properties  given  in  the  basic 

direction  occurring  ot  the  given  angle  and  spacing  symbolic  description, 
then  one  edge  occurs. 

There  are  three  strong  indications  of  element 
'.fell  defined  texture  pa ctoror,  produce  obvious  size  provided  in  the  texture  description  given  in 
patterns  in  the  ERA,  and  it  is  from  these  patterns  Fig*  f*  Knowing  the  exact  locations,  within  the 

io  the  LIRA  which  we  will  derive  the  description  of  original  texture  image,  where  the  edge  matches 

the  texture  pattern.  For  example,  a  regular  contributing  to  these  strong  peaks  occur  is 

pattern  of  dark  ind  light  elements  will  produce  an  useful.  It  is  then  possible  to  isolate  the 

edge  image  with  regularly  spaced  edges.  The  ERA  uniform  intensity  regions  or  textural  elements 

for  opposite  edge  directions  will  have  a  peak  being  measured.  Analysis  ot  the  set  of  textural 

value  at  the  element  size  and  peaks  spaced  by  the  elements  or  primitives  for  a  particular 

width  of  the  two  elements.  A  pattern  created  by  predominant  element  size  then  provides  the  average 

the  random  arrangement  of  similar  elements  will  intensity,  are,  shape,  etc.  for  that  primitive 

not  have  significant  nunbwrs  of  edge  repetitions  group. 

(for  opposite  edge  directions)  except  for  the 

distance  corresponding  to  the  element  width  -  one  In  Fig.  2,  we  can  see  that  all  af  the 

relatively  large  peak  at  a  small  distance  and  information  for  a  texture  is  listed  according  to 

lower  values  afterward  with  no  other  dominant  relative  element  intensity  and  scan  direction, 

distance  value.  In  the  data  for  the  same  edge  For  example,  elements  of  size  3  and  spacing  8 

direction  the  situation  is  reversed  -  few  occur  in  the  vertical  scan  direction;  and  these 

repetitions  at  small  distances  and  relatively  more  elements  are  dark  in  relation  no  their  vertical 

at  larger  distances,  but  no  repetitive  structure  neighbors.  Since  the  scan  direction  and  relative 

in  the  data.  element  intensity  of  each  predominate  element  size 

is  known,  the  edge  pairs  exhibiting  the  properties 
The  symbolic  description  of  the  pattern  is  can  be  located, 

derived  from  the  edge  repetition  arrays.  The 

basic  technique  is  summarized  below:  The  points  between  these  pairs  of  edges  serve 

as  the  initial  interior  points  of  the  texture 

1)  Find,  classify  and  describe  the  peaks  (local  elanents.  These  primitive  slices  are  expanded  to 

maxima;  in  the  ERA's,i.e.,  c.a  strong  or  weak.  form  the  mask  for  the  particular  primitive 

elements.  The  original  image  and  the  binary  edge 

2)  Examine  the  elanent  spacing  (size  of  pairs  of  image  are  both  used  by  the  expansion  procedure, 

elements)  data  to  determine  if  there  are  any  The  expansion  is  perpendicular  tv  ^tia  scan 

peaks  which  repeat  -  i.e.,  occur  at  multiples  direction,  i.e.,  if  the  initial  description  is  for 

of  tne  first  element  spacing  value.  the  vertical  then  the  expansion  is  in  the 

horizontal  direction. 

3)  Examine  the  element  size  data  to  determine  if 

there  is  support  for  a  spacing  value  The  expanded  primitives  within  the  mask  image 

determined  above  -  i.e.,  local  maxima  spaced  can  then  be  analyzed  to  determine  various 

at  the  given  distance.  properties  of  the  basic  element  such  as  an  average 

primitive  intensity,  an  average  primitive  area  in 

4)  Apply  a  set  of  heuristics,  expressed  as  pixels  and  the  average  primitive  dimension  in  the 

production  rules,  to  generate  the  final  direction  perpendicular  to  the  line  of  scan, 

description. 

Results  generated  for  the  raffia  sample  of 
The  description  of  the  texture  of  Fig.  1  is  given  Fig.  1  are  given  in  Fig.  T.  The  dark  primitive 

in  Fig.  2.  found  in  the  vertical  scan  direction  corresponds 

to  bock  B  in  the  abstract  representation  of  the 
DESCRIPTION  OF  PRIMITIVES  texture  pattern  given  in  Fig.  4.  The  light 

primitive  found  in  tho  vertical  scan  direction 
Tbe  basic  symbolic  description  outlines  in  corresponds  to  block  A  in  Fig.  <i  and  the  light 

Section  II  above  is  only  a  start.  There  remain  primitive  found  by  the  horizontal  scan  corresponds 

many  problems.  There  is  no  indication  of  the  to  block  C.  The  first  primitive  dimension  given 

overall  shape  of  the  elements,  there  is  no  in  each  of  the  primitive  descriptions  ir  the. 

coordination  of  description  from  one  direction  dimension  along  the  line  of  scan,  (i.e.,  the 

with  descriptions  for  other  directions,  and  there  direction  in  which  it  was  described  earlier  in 

is  no  way  to  immediately  derive  various  properties  Fig.  2  while  the  second  dimension  is  in  a 

cf  the  elements.  Tb  determine  these  properties  we  direction  pe; pend ic ill ar  to  the  scan  line, 

must  extract  all  the  primitive  texture  elements  Figure  5  gives  the  primitive  mask  images 

from  the  image.  Earlier  methods  I6,7j  made  an  correspond  ire;  to  the  primitives  of  types, 

attempt  to  extract  primitives  as  the  firot  step. 

But,  there  are  problems  >Mier.  elemunts  are  very 
small,  the  intensify  levels  vary  across  the  image 
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SPATIAL  RELATIONS  OF  TBiM’URfi  ELEMENTS 

Up  to  this  point  the  only  description 
inf  >rmation  we  have  relctiry  to  the  arrangement  of 
the  texture  primitives  is  the  elsnent  spacing 
information  described  in  Section  lx.  However, 
this  information  pertains  to  only  a  single  scan 
direction.  Therefore,  we  can  tell  only  if  a 
certain  group  of  primitives  in  periodic  in  one 
direction  and  if  so,  what  period  if  exhibited. 
When  no  clement  spacing  can  he  found  for  any  of 
the  elements  within  a  texture,  the  texture  pattern 
is  assumed  to  be  random.  In  the  event  that  we  are 
considering  a  non-random  texture  pattern  this 
spacing  information  is  not  sufficient  to 
characterize  the  particular  placement  rules  within 
the  texture.  There  is  otter  work  in  the  area  of 
describing  textures  by  relationships  between 
elements  [12,13,14]  but  these  others  generally 
make  more  assumptions  about  the  type  of  input 
textures. 

Consider  the  2  brick  patterns  in  Fig.  6  (a 
and  b) .  These  patterns  are  similar  in  that  each 
contains  lignt  and  dark  elements  vhich  are 
rectangular  and  are  arranged  in  a  regular  pattern, 
"he  arrangement  of  the  bricks  within  the  patterns 
is  different,  but  no  evidence  oc  this  difference 
is  givan  in  their  primitive  descriptions  (see 
Fig.  7a  and  b)  .  Both  sets  of  bricks  were  detected 
by  scanning  vertically,  and  both  exhibit  regular 
elsnent  spacing  in  this  direction.  However,  a 
definitive  description  of  element  arrangement  is 
xackirw.  This  placement  information  is  contained 
xn  tne  primitive  masks  and  composite  images  (see 
Fig.  8),  and  is  the  object  of  analysis  described 
here. 

Determination  of  Relations 

Individual  primitive  masks  provide  the 
locations  of  all  the  primitives  of  a  particular 
type  within  an  image.  Determining  the  predominant 
placement  rules  exhibited  by  these  elements  can  be 
divided  into  a  number  of  independent  subtasks; 


1)  Determine  whether  2  sets  of  primitive 
masks  represent  the  same  textural  elements  by 
combining  all  pairs  of  masks  to  calculate  overlap. 
This  situation  arises  when  a  textural  element  is 
detected  in  more  than  one  scan  direction.  This 
should  give  a  reduced  set  of  primitives. 

2)  Determine,  primitive  location  (centroid) 
and  compute  relations.  Rather  than  computing 
relations  by  taking  all  pairs,  only  the  relations 
to  primitives  (of  any  type)  closest  to  a  given 
primitive  are  considered.  The  closest  ones  are 
located  by  looking  in  12  directions  (3C°  apart) 
from  the  given  primitive.  Totals  are  kept  for 
angle,  distance,  and  the  type  of  pair. 

3)  Find  the  predominant  relations: 
Normalited  the  values  to  adjust  for  numbers  of  the 
type  of  primitives.  Replace  all  points  above  a 
thresnold  by  the  sun  of.  their  3x3  neighborhood. 
Find  the  predominates  local  maxima.  This  gives  a 
paii  od  primitives  and  an  angle  and  distance 


separating  them.  Figure  9  gives  the  computed 

results  for  the  two  brick  patterns.  This  can 
serve  as  a  basis  foi  descriptions  or  it  can  be 
used  tc  reconstruct  the  texture. 

Generation  of  Patterns 

One  indication  of  the  completeness  of  a 
texture  description  is  the  effectiveness  the 

description  to  regenerate  the  textxre  pattern. 
The  descriptions  given  in  Fig.  9  (and  a 
description  of  each  element)  can  be  used  to 
recreate  the  original  texture  patterns.  Certain 
assumptions  about  the  pattern  are  necessary.'  it  is 
a  single  pattern  (composed  of  a)l  the  derived 
element  types)  and  the  relations  between  elements 
of  the  same  type  is  the  same  for  oil  types  of 
elements  (i.e.,  if  a  grid  is  found  for  placing  one 

element  type,  the  same  grid  car.  be  used  for  all 

others  -  with  translation) . 

The  generation  process  has  several  stops: 

1)  Patings  for  relations  which  are  the  same, 
except  for  a  18(P  difference  in  directions, 
are  combined. 

2)  The  highest  rated  relation  for  an  element  with 

itself  Is  chosen  (marked  with  in  Fig.  9). 
But  itself,  this  gives  only  a  one  dimensional 
pattern,  therefore  the  second  best  relation 
must  be  also  used  (marked  with  . 

Coll inear  relations  are  ignored  and  the  rating 
is  dependent  on  the  distance  -  relations  to 
close  objects  are  needed  for  generation  of  the 
pattern.  Thus  in  Fig.  9b  the  15CP-300  pair  is 
chosen  rather  than  the  180°,  0°  pair.  These 
two  placement  rules  define  a  basic  grid  for 
the  pattern  (i.e.,  place  an  element  at  each 
grid  point) . 

3)  Select  the  highest  rated  relation  between  a  a 
element  already  in  the  grid  and  one  not  in  the 
grid  (marked  with  "***"  in  Fig.  9).  Put  this 
element  into  the  pattern  in  the  given  relation 
to  all  those  already  there. 

4)  Continue  to  3  for  all  other  elements  (the 
examples  here  have  only  2  elements,  but  3  or 
more  are  possible) . 

Figure  10  shows  the  results  for  these  2 
patterns.  Ir»  the  first  pattern  some  of  the  mortar 
area  was  detected  as  a  background  area  and  thus 
the  image  was  initialized  with  this  value.  In  the 
second  pattern  no  back  ground  was  indicated  so  the 
unfilled  area  is  black.  The  small  vertical  mortar 
pieces  were  not  located  and  thus  are  not  included 
in  the  output  pattern. 

APPLICATIONS  AND  CONCLUSIONS 

We  have  used  these  descriptions  for 
recognition.  In  some  preliminary  experiments 
using  a  decision  tree  classifier  with  the  symbolic 
descriptions,  all  the  test  samples  were  correctly 
identified.  This  was  a  test  using  a  small  number 
of  samples  (it  works  with  large  areas,  not 
individual  pixels)  and  a  small  set  of  texture 


types,  but  it  docs  provide  an  indication  of  the  Thesis,  to  appear, 

robustness  of  the  description  process. 

7.  A.  Rosenfeld,  "Cooperative  Computation  in 
This  wrk  is  continuing  with  the  analysis  of  Texture  Analysis,"  in  Proc.  Image  Understanding 

texture  gradients  including  extraction  of  ttie  Workshop,  Los  Angeles,  Nov.  1979,  pp.  52-96. 

necessary  information  from  real  images  and 

analysis.  We  nave  applied  this  system  to  a  8.  H.  Tamura,  S.  Mori,  T.  Yamawaki,  "’textural 

variety  of  texture  types  and  it  is  capable  of  Features  Corresponding  to  Visual  Perception,"  IEEE 

producing  effective  descriptions.  Trans.  SMC -8,  June,  1978,  pp.  460-473. 
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Fig.  1.  Raffia  sutwindew. 
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NUMBERS  APPEARING  IN  PARENTHESES  ARE  SCALE  DEPENDENT 


FILENAME  »  RAFFIA. NR10 

DARK  OBJECT  DESCRIPTIONS 

HORIZONTAL  SCAN  DIRECTION 

NO  EVIDENCE  OF  PERIODICITY  OR  PREDOMINANT 
ELEMENT  SIZE 

30°  SCAN  DIRECTION 

NO  EVIDENCE  OF  PERIODICITY 
WEAK  EVIDENCE  OF  PREDOMINANT  ELEMENT  SIZE 
(25-*0) 

VERTICAL  SCAN  DIRECTION 

STRONG  EVIDENCE  OF  PERIODICITY  I  ELEMENT 
SPACING  8.00) 

STRONG  EVIDENCE  OF  PREDOMINANT  ILEMENT 
SIZE  (3.00)  WITH, MODERATE  SIH-FGRT  FOR 
ELEMENT  SPACING  (8.00) 

RATIO  OF  SIZE  TO  PERIOD  IS  .38 


LIGHT  OBJECT  DESCRIPTIONS 

HORIZONTAL  SCAN  DIRECTION 

NO  EVIDENCE  OF  PERIODICITY 
STRONG  EVIDENCE  OF  PREDOMINANT  ELEMENT  SIZE 
(3.00) 

30°  SCAN  DIRECTION 

NO  EVIDENCE  OF  PERIODICITY  OR  PREDOMINANT 
ELEMENT  SIZE 

VERTICAL  SCAN  DIRECTION 

STRONG  EVIDENCE  OF  PERIODICITY  (ELEMENT 
SPACING  8.00) 

STRONG  EVIDENCE  OF  PREDOMINANT  ELEMENT 
SIZE  (5.00)  WITH  MODERATE  SUPPORT  FOR 
ELEMENT  SPACING  (8.00) 

RATIO  OF  SIZE  TO  PERIOD  IS  .63 


60°,  120°  and  150°  SCAN  DIRECTIONS  FOR  BOTH  LIGHT 
AND  DARK  OBJECTS 

NO  EVIDENCE  OF  PERIODICITY  OF  PREDOMINANT 
ELEMENT  SIZE 


PRIMITIVE  ANALYSIS  FOR  TEXT.  SUPP12  (THRESH  -  10) 

RELATIVE  INTENSITY  IS  DARK  DIRECTION  IS 
VERTICAL 

NUMBER  OF  SAMPLES:  108 

AVERAGE  PRIMITIVE  DIMENSIONS  ARE:  (2.00  AND 

10.39) 

AVERAGE  PRIMITIVE  SIZE  IN  PIXELS  IS:  (20.30) 

AVERAGE  PRIMITIVE  INTENSITY  IS:  (128.79) 

PRIMITIVES  REPEAT  AT  ELEMENT  SPACING:  (8.00) 
IN  ABOVE  MENTIONED  DIRECTION 

RELATIVE  INTENSITY  IS  LIGHT  DIRECTION  IS 
VERTICAL 

NUMBER  OF  SMAPLES:  109 

AVERAGE  PRIMITIVE  DIMENSIONS  ARE:  (A. 00  AND 

9-33) 

AVERAGE  PRIMITIVE  SIZE  IN  PIXELS  IS:  (36. 9A) 

AVERAGE  PRIMITIVE  INTENSITY  IS:  (172.35) 

PRIMITIVES  REPEAT  AT  ELEMENT  SPACING:  (8.00) 
IN  ABOVE  MENTIONED  DIRECTION 

RELATIVE  INTENSITY  IS  LIGHT  DIRECTION  IS 
HORIZONTAL 

NUMBER  OF  SAMPLES:  68 

AVERAGE  PRIMITIVE  DIMENSIONS  ARE:  (2.00 
AND  7.88) 

AVERAGE  PRIMITIVE  SIZE  IN  PIXELS  IS:  (15. 18) 
AVERAGE  PRIMITIVE  INTENSITY  IS:  (190.47) 

NO  EVIDENCE  OF  PERIODICITY 

Fig.  3.  Raffia  primitive  texture  element 
description. 


Fig.  2.  Symbolic  texture  description  of  raffif. 


PRIMITIVE  ANALYSIS  FOR  BRICK 


RELATIVE  INTENSITY  IS  LIGHT  DIRECTION  IS 
VERTICAL 

NUMBER  OF  SAMPLES:  87  PRIMITIVE  NUMBER  IS:  I 

AVERAGE  PRIMITIVE  DIMENSIONS  ARE:  (2.00 
AND  52.92) 

AVERAGE  PRIMITIVE  SIZE  IN  PIXELS  IS:  (105.39) 

AVERAGE  PRIMITIVE  INTENSITY  IS:  (120.80) 

PRIMITIVE  REPEAT  AT  ELEMENT  SPACING: 

(15.00)  IN  ABOVE  MENTIONED  DIRECTION 


RELATIVE  INTENSITY  IS  DARK  DIRECTION  IS 
VERTICAL 

NUMBER  OF  SMAPLES:  106  PRIMITIVE  NUMBER  IS:  2 

AVERAGE  PRIMITIVE  DIMENSIONS  ARE:  (10.00 
AND  35  3*0 

AVERAGE  PRIMITIVE  SIZE  IN  PIXELS  IS:  (353-14) 

AVERAGE  PRIMITIVE  INTENSITY  IS:  (100.59) 

PRIMITIVES  REPEAT  AT  ELEMENT  SPACING:  (15.00) 
IN  ABOVE  MENTIONED  DIRECTION 


Fig.  7a.  Brick  pattern  1  primitive  texture 
element  description. 


Fig.  8a.  Brick  pattern  1  composite 
primitives. 


PRIMITIVE  ANALYSIS  FOR  BRICK  2 


RELATIVE  INTENSITY  IS  LIGHT  DIRECTION  IS 
VERTICAL 

NUMBER  OF  SAMPLES:  39  PRIMITIVE  NUMBER  IS:  1 

AVERAGE  PRIMITIVE  DIMENSIONS  ARE:  (2.00 
AND  145.26) 

AVERAGE  PRIMITIVE  SIZE  IN  PIXELS  IS:  (289-87) 

AVERAGE  PRIMITIVE  INTENSITY  IS:  (175-44) 

PRIMITIVES  REPEAT  AT  ELEMENT  SPACING:  (12.00) 
IN  ABOVE  MENTIONED  DIRECTION 


RELATIVE  INTENSITY  IS  DARK 
VERT  I  CAL 

NUMBER  OF  SAMPLES:  122 


DIRECTION  IS 


PRIMITIVE  NUMBER  IS:  2 


AVERAGE  PRIMITIVE  DIMENSIONS  ARE:  (8.00 
AND  40.93) 

AVERAGE  PRIMITIVE  SIZE  IN  PIXELS  IS:  (326-57) 

AVERAGE  PRIMITIVE  INTENSITY  IS:  (134.25) 

PRIMITIVES  REPEAT  AT  ELEMENT  SPACING:  (12.00) 
IN  ABOVE  MENTIONED  DIRECTION 


Fig.  7b.  Brick  pattern  2  primitive  texture 
element  description. 
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Fig.  8b.  Brick  pattern  2  composite 
primitives. 


Figure  8. 


BRICK2  PRIMITIVE  PLACEMENT  RESULTS 


BRICK!  PRIMITIVE  PLACEMENT  RESULTS 
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Fig,  9a.  Brick  pattern  1  intcrprinutive  angle 
in  degrees,  distance  in  pixels  and 
frequency  of  occurrence. 
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Fig.  9b.  Brick  pattern  2  interprimitive  angle  in 
degrees,  distance  in  pixels  and 
frequency  of  occurrence. 


Figure  9. 


Figure  10. 
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Abstract 

In  this  paper,  binary  and  gray-level  natural 
textures  are  synthesized  using  several  methods. 
The  quality  of  the  natural  texture  simulations 
depends  on  the  computation  time  for  data 
collection,  computation  time  for  generation,  and 
storage  used  in  each  process.  Many  textures  are 
adequately  simulated  using  simple  models  thus 
providing  a  potentially  great  information 
compression  for  many  applications.  Other  textures 
with  macrostructure  and  nonstationary 
characteristics  require  more  extensive  computation 
to  synthesize  visually  pleasing  results.  Although 
the  success  of  texture  synthesis  is  highly 
dependent  on  the  texture  itself  and  the  modeling 
method  chosen,  general  conclusions  regarding  the 
performance  of  various  techniques  are  given. 


1.  Introduction 


Texture  is  important  characteristic  for  the 
analysis  of  many  types  of  images.  It  is  an 
important  feature  for  discrimination  and 
identification  of  regions  in  images  arri  as  a 
result,  many  techniques  for  texture  analysis  have 
been  in  the  area  of  discrimination.  Hence,  many 
different  texture  discrimination  techniques  have 
been  developed  [13].  Most  are  ad  hoc. 

Texture  synthesis  has  been  over-shadowed  by 
the  emphasis  placed  on  the  discrimination  problem 
and  its  applications.  Little  work  has  been  done 
in  the  synthesis  area  even  though  nunerous 
applications  exist.  For  example,  sensors  could 
identify  boundaries  of  textured  image  regions. 
Based  on  statistics  gathered  by  the  sensor,  this 
region  could  be  reconstructed  using  simulation 
techniques  with  little  or  no  loss  of  information. 
The  result  is  excellent  information  compression. 

Texture  synthesis  can  also  be  used  as  a 
texture  analysis  tool  leading  to  a  better 
understanding  of  textures  and  their  perception  by 
humans  as  well  as  improved  methods  of 
discrimination.  By  carefully  controlling  the 
statistics  of  a  texture  in  a  synthesis  process 
visual  changes  in  texture  are  observed.  Ihus, 
texture  synthesis  methods  allow  researchers  to 


identify  and  measure  the  Information  content  of 
individual  statistical  measurements.  By 
assembling  these  measurements  and  incorporating 
them  into  a  texture  simulation  process,  statistics 
may  be  measured  from  a  parent  texture  and  used  to 
produce  a  texture  simulation.  The  degree  to  which 
the  parent  and  simulation  are  visually  similar 
indicates  the  value  of  the  statistical 
measurements  and  the  model  used  in  the  simulation 
process.  Given  a  group  of  statistical 
measurements  vrtiich  are  proposed  to  be  useful 
texture  measures,  the  best  may  be  chosen  based  on 
the  quality  of  the  corresponding  texture 
simulations.  In  this  way,  researchers  are  able  to 
develop  better  discrimination  as  well  as  better 
synthesis  methods. 

2.  Concepts  of  Texture  Synthesis 

Despite  its  importance,  a  precise  definition 
of  texture  does  not  exist.  Taxture  is  often 
considered  to  be  composed  of  a  set  of  primitives 
and  their  spatial  organization.  More  importantly, 
texture  usually  possesses  an  invariance  property. 
It  is  this  invariance  property  which  we  will  use 
to  nebuloiisly  define  texture  in  this  paper.  An 
observer  should  detect  no  visual  difference 
between  one  windowed  portion  of  a  textured  region 
and  another.  Thus  texture  is  also  a  function  of 
window  size.  If  a  difference  over  a  region  is 
detected  then  either  the  texture  is  not 
tiomogeneous  or  a  larger  window  should  be  used. 
Windowing  is  very  important  when  gathering 
statistics  to  be  used  for  texture  discrimination 
or  texture  synthesis. 

The  approach  to  texture  synthesis  used  in 
this  paper  is  outlined  in  Fig.  1.  As  a  first  step 
in  the  .synthesis  process,  statistics  are 
calculated  f  measurements  taken  on  a  parent 
sample  texture.  The  statistics  are  then  used  to 
estimate  model  parameters.  In  the  final  step, 
these  model  parameter  estimates  are  used  to 
generate  a  texture  synthesis. 

All  of  the  digital  images  in  this  paper  are 
512  by  512  pixels.  They  have  either  256  gray 
levels  (continuous  tone)  or  2  gray  levels 
(binary).  The  original  parent  texture  images  in 
this  paper  have  been  chosen  from  an  album  by 
Brodatz  [1].  High  quality  prints  obtained  from 
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the  photographer  were  scanned  end  digitised  't  the 
USC  Image  Processing  Institute. 

3.  Statistics  of  a  Texture 

The  terminologies  in  a  portion  of  previous 
texture  work  have  often  been  vague  at  best.  As  a 
result,  the  cermr  second-order  and  third-order 
have  been  seriously  twisted  and  misinterpreted 
from  study  to  study.  In  this  paper,  we  will 
attempt  to  suppress  this  confusion  by  carefully 
defining  the  various  terms. 


The  stochastic  approach  toward  texture 
analysis  considers  texture  fields  as  samples  of 
two-dimensional  stochastic  f  ields.  Assuming  that 
we  are  dealing  with  sampled  and  quantized  imagery, 
let  i(nix,rvi2^  ^er,ote  random  field.  Here 
and  n,  ,  are  integers  representing  coordinates  of 
points  in  the  image  plana.  Let  n^  be  the  vector 
having  coordinates  and  (i.c.ftj* (*Ul»ni2) )  • 
Second-order  statistics  are  given  by  the  set  of 
second-order  joint  density  functions 


P-*-  •+  (v.,v.)  (1) 

n  .  ,  n  .  1  ] 

i  3 

for  all  possible  vectors  and  n^,  where  V^  and 
Vj  are  the  values  of  the  random  variables  ICfij.) 
and  I  (ft j) ,  respectively.  In  most  texture  work  and 
in  all  of  the  work  in  this  paper  (except  for  the 
work  in  section  13)  the  random  field  is  assumed  to 
be  homogeneous,  that  is,  all  orders  of  probability 
densities  are  invariant  through  translations. 
Thus, 


(2) 

constant ,  As  an 


P<V1'V2)E  P(W  O) 

vtfiere  V^ ,  V,,  V^and  \i  are  as  showi  in  Fig.  2. 
In  most  of  our  work,  dummy  values  of  random 
variables  (denoted  for  example  by  Vj)  will  be  used 
to  label  pixels  at  vector  location  n^. 

Given  the  assumptions  that  a  texture  field  is 
homogeneous,  the  joint  density  functions  P  for 
all  vector  separations  t  =  represent  the 

most  complete  set  of  second-order  statistics 
possible.  The  statistical  expectation  of  any 
functions  of  these  joint  density  functions  are 
called  second-order  statistics.  If  a  pixel  is 
connected  to  any  of  its  neighbors  on  the  same  row, 
that  is  if  we  consider  neighbors  immediately  to 
the  left  or  right  (such  as  V5  and  Vg  in  Fig.  2) 
then  their  joint  density  is  called  a  second-order 
nearest- neighbor  joint  density  and  any  statistical 
expectations  of  the  joint  density  are  second-ordei 
nearest-neighbor  statistics. 


p-»  -*  =  p  >  •* 

ni'nj  ni+'-<  nj^ 
vrtiere  6  is  an  arbitrary  vector 
example, 


Assuming  !»mogeneity  of  the  texture,  then 

ni»nj,n^  n.j+c.n.. +0,^40 

for  all  and  an  arbitrary  vector  constant  c.  As 
an  example, 

P(VltV2,V3)  =  P(V4,V5,V6)  (6) 

in  Pig.  3.  The  statistical  expectations  of  any 
function  of  those  third-order  densities  are  called 
third-order  statistics.  All  second-order 
statistics  may  be  derived  from  third-order  joint 
densities. 

4 .  The  Stochastic  Synthesis  Approach 

Many  early  texture  studies  involved  the  use 
of  binary  textures  generated  by  one-dimensional 
Markov  processes.  Such  work  was  presented  by 
Juiesz  (2]f  Pollack  (3),  Purks  and  Richards  [4] 
and  Garber  (’3 ] .  In  thesa  one-dimensional  models  a 
large  vector  of  pixels  was  generated  line  by  line 
using  a  set  of  parameters 

p(Vi/vi,v2 . V  <7> 

where  P(A/B)  represents  the  probability  of  A  given 
3-  We  will  refer  to  these  conditional 
probabilities  as  generation  parameters.  In  the 
above  notation  each  V  represents  a  generated 
pixel  which  has  value  0  (black)  or  1  (white) . 
Thus  each  pixel  value  depends  on  the  N  pixels 
previous  to  it.  A  two-dimensional  texture  image 
is  then  formed  by  breaking  up  the  large  vector  of 
pixels  into  shorter  str  mgs  and  stacking  them  one 
on  top  of  the  other  (see  Fig.  4).  This  procedure 
for  large  images  nearly  insures  image  row 
independence  (unless  N  was  large)  thereby  creating 
only  horizontally  oriented  textures  totally 
unsuitable  for  simulating  natural  two-dimensional 
textures. 

By  allowing  N  to  increase  exceeding  the  short 
string  line  length,  two-dimensional  (vertical  and 
horizontal)  dependence  may  be  induced  into  the 
generating  process.  A  pixel  value  then  depends 
not  only  on  the  pixels  previous  to  it  on  the  same 
line  but  also  on  the  pixels  above  it  (see 
Fig.  5(b)).  Thus,  textures  could  be  generated  as 
a  time  sequence  in  television  raster  scan  fashion. 
In  theory,  texture  dependence  could  be  extended  ad 
infinitum,  however  practical  considerations 
concerning  the  actual  generation  process  show  us 
that  2”  generation  parameters  must  be  accounted 
for.  As  a  possible  solution  to  the  storage 
problem  we  can  choose  to  ignore  all  but  N  of  the 
previous  pixels  in  our  generation  process  and  ve 
can  allow  the  pattern  of  the  Vi's  to  become 
flexible.  This  idea  will  be  discussed  in  later 
sections  of  this  paper. 


Similarly,  third-order  statistics  are  given 
by  the  set  of  third-order  density  functions 


p-v  -* 

ni’nj,nk 


'wv 


(4) 


Throughout  the  remainder  of  this  paper,  the 
set  of  pixels,  V^s,  on  which  the  next  pixel, 
VN+1,  depends  will  be  referred  to  as  the  "kernel" 
of  the  synthesis  process.  The  pixel  Vfj^  will  be 
referred  to  as  the  "eye"  of  the  kernel. 
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Given  a  parent  texture  we  can  estimate  the 
generation  parameters  needed  to  generate  it  in 
many  ways.  Ignoring„boundary  conditions,  linear 
unbiased  estimates  (P’s)  of  the  P's  may  be  defined 
as 

M  N 

P<VV2 . V  "£  ff  «<Kk+j)-Vk)  (8) 

j=0  k=l 


the  V5  pixel  directly  depends  on  only  some  of  the 
pixels  in  its  surrounding  neighborhood.  In  this 
case,  V5  may  be  generated  based  on  the  values  of 
pixels  Vi,V 2»V3  and  V4  but  is  directly  dependent 
on  no  other  pixels  in  the  neighborhood.  This  does 
not  imply  that  V5  is  not  related  or  correlated 
with  its  other  neighbors.  In  fact,  the 
relationships  between  V]_,  V2>  V3,  V4,  and  V5  will 
determine  other  interrelationships. 


where 

Wj'V 


0  if  v./  vk 

1  if  V  vk 


anu  I ( i)  represents  the  ith  element  of  the 
one-dimensional  texture  string  from  vhich  the 
parameters  are  to  be  estimated.  Equation  (8) 
assumes  that  the  Vi  are  contiguously  located  in 
order  along  a  line.  It  simply  states  that  in 
order  to  estimate  P(Vx,V2, . .  ,Vfl)  for  a  fixed 
pattern  Vi  ,V2, . . .  ,VN,  all  M  substrings  (samples) 
of  length  N  are  taken  from  a  parent  substring  of 
length  M-tN-1  and  the  nunber  of  occurrences  of  the 
specific  pattern  Vj., V2, . . . ,Vfj  are  counted,  then 
divided  by  M.  This  is  equivalent  to  estimating 
the  probability  density  function  of  a  random 
variable  by  the  histogram  of  a  set  of  samples. 


This  idea  of  estimating  N-grams, 
p(Vl»V2,  ...,Vn)  ,  from  a  sanple  parent  texture  may 
be  extended  to  the  two-dimensional  case.  A 
histogram  of  occurrences  of  each  pattern  of 
(Vl,V2, . . .  ,Vn)  is  made  by  passing  the 
two-dimensional  kernel  in  Fig.  5(b)  over  the 
two-dimensional  sample  parent  image.  The  tally  is 
then  divided  by  the  total  number  of  sample 
patterns  observed  to  obtain  PO/^, . . .  ,VN) .  As  was 
stated  earlier,  two-dimensional  synthesis  is 
merely  an  extension  of  the  one-dimensional  case 
ignoring  boundary  conditions  of  the 
two-dimensional  image. 


The  generation  parameters  of  a  texture  may  be 
estimated  for  any  given  set  of  N  V  's  from  a 
parent  texture  using  the  estimates  of  the  P's  from 
Bq.  (8).  lhese  statistics  have  the  property 

. V1  =  P(VA . V 

where 


P<WV1 . V  =  P(V1 . VW7 


(P(V 


1'*‘ 


.,vN,0)  +  p(vlf...,vN,i)). 


(9) 


5.  Kernel  Selection  Using  The  Linear  Model 

We  will  refer  to  the  Vi's  on  which  the  next 
pixel,  Vn+i,  depends  as  the  kernel  of  the 
generation  process.  Geometrically  speaking,  the 
Vi's  form  a  kernel  "shape"  or  "pattern"  which  may 
or  may  not  be  spatially  contiguous.  For  example, 
in  Fig.  6  a  generating  kernel  shape  is  shown  vhere 


A  non-contiguous  neighborhood  of  Vj's  is  used 
as  it  allows  a  more  parsimonious  model  for  texture 
generation  to  be  chosen.  An  analogy  is  in  simple  « 
linear  regression  (as  defined  by  Draper[10])  where 
independent  variables  which  do  not  contribute  to 
the  prediction  or  estimation  of  the  dependent 
variable  are  dropped  from  the  model  equation.  In 
texture  generation  this  allows  the  model  to  be 
estimated  by  fewer  parameters  and  makes  the 
generation-synthesis  process  more  efficient  by 
reducing  the  number  of  computations  required. 
Vhen  generating  textures  based  on  N-grams, 
reducing  V  reduces  the  amount  of  storage  required 
for  gN  generation  parameters  as  defined  by  Bq.  (9) 
vhere  g  is  the  number  of  gray  levels  in  the  image 
and  N  is  the  number  of  elements  in  the  synthesis 
kernel.  By  allowing  the  kernel  of  Vi's  on  which 
VN+1  depends  to  be  non-contiguous,  the  range  of 
dependence  in  a  distance  sense  is  increased  over 
that  which  would  be  allowed  with  a  contiguous 
kernel  containing  the  same  number  of  Vi's.  This 
is  very  important  to  obtain  the  larger  structure 
apparent  in  many  textures.  Reducing  the  number  of 
pixels  in  the  model  also  relieves  us  from  the 
complex  numerical  problems  of  inverting  matrices 
of  unwieldy  size,  a  necessary  step  in  linear  model 
parameter  estimation  discussed  later  in  this 
paper.  We  would,  for  example,  not  expect  our  V^i 
pixel  to  depend  on  a  pixel  Vi  where  the  spjatial 
separation  between  Vfj+i  and  Vi  is  large.  If  that 
distance  is  small,  however,  we  would  expect  a 
large  dependence. 

The  method  for  choosing  the  proper 
independent  variables  (Vi's)  to  be  included  in  the 
generation  process  requires  special  attention.  We 
wish  to  choose  the  best  subset  of  N  variables  from 
a  larger  finite  neighborhood  of  T  variables,  vftere 
N<T.  Evaluating  such  subsets  and  their 
corresponding  models  requires  a  criterion. 
Texture  results  for  each  possible  model  could  be 
visually  examined  and  compared  and  the  Vi’s  of  the 
model  corresponding  to  the  visually  most  pleasing 
result  could  be  chosen.  However,  (§)  model 
evaluations  must  be  done  using  this  approach.  For 
a  simple  search  through  T  =  40  points  with  N  »  12, 
5.5  billion  models  vould  have  to  be  evaluated! 
This  approach  is  therefore  impractical  and  so  a 
sub-optimal  approach  which  yields  a  good  but  not 
necessarily  the  best  set  of  Vi's  for  our  model 
must  be  used. 

If  we  view  this  problem  as  one  of  predicting 
a  dependent  variable,  Vn+1  from  a  large  set  of 
independent  variables,  V^'s,  then  the  standard 
linear  model  approaches  may  be  applied.  A  forward 
selection  procedure  was  used  to  choose  the  set  of 
Vi's  in  the  model  atxi  it  is  explained  in  detail  by 
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Draper  and  smith  [10).  One  at  a  time,  the  Vi's 
are  entered  into  the  model  and  the  linear  equation 

WrtV-W’  (10) 

At  each  step,  the  Vi  which  minimize,  the  overall 
sum  of  squares  error  when  the  corresponding  linear 
model  is  applied  to  the  original  input  data  is 
entered  into  che  model.  Variables  are  entered 
until  either  a  maximim  number  is  tai  ad  or  the 
magnitude  of  the  corresponding  coefficient  is 
small. 

5-t  Seeding  jthe  Process 


When  synthesizing  a  two-dimensional  image,  u 
frame  of  pixels  is  needed  to  seed  the  process  as 
shown  in  Fig.  7.  The  seeding  process  may  be 
handled  in  a  variety  of  ways.  tee  simplest 
approach  would  be  to  randomly  generate  the  seed 
once  for  the  whole  image.  In  this  case  the  pixel 
values  in  the  seed  frame  of  Fig.  7  remain  the  same 
throughout  the  generation  process.  Parent  texture 
data  could  also  be  used  to  seed  the  generation 
procedu.  s.  Regardless  of  the  seeding  process,  alx 
texture  synthesis  methods  developed  in  this  paper 
normally  converge  to  a  steady  state  within  5  to  20 
pixels  of  the  border  ot  the  image.  This  was 
confirmed  by  repeated  studies  of  convergence 
effects  on  texture  simulations.  In  most  cases, 
this  narrow  region  is  not  noticeable  and  is 
included  as  a  part  of  the  result.  In  some 
critical  applications  these  edges  could  be  thrown 
away. 

7.  Results 

Once  the  points  for  the  kernel  are  chosen 
based  on  the  linear  model  derived  using  the 
methods  described  in  the  previous  sections, 
estimates  of  the  generation  parameters  for  the 
texture  are  obtained  using  concepts  discussed  in 
section  4.  Practical  considerations  require  us  to 
use  binary  images  (<f=2)  and  limit  N,  the  number  of 
pixels  in  the  kernel,  to  12  to  18  depending  on  the 
processor  storage  available  as  2N  values  must  be 
stored.  These  conditional  probabilities  are  then 
used  to  generate  each  pixel  along  a  row,  row  by 
row  until  a  complete  two-dimensional  texture  is 
obtained.  For  each  pixel  the  appropriate 
generation  parameter  estimate  is  found  and  a 
uniformly-distributed  pseudo-random  variable  is 
generated.  Based  on  these  two  values,  a  black 
pixel  (0)  or  white  pixel  (1)  is  generated. 

In  practice,  not  all  of  the  generation 
parameters  may  be  estimated  when  N  is  large 
because  all  possible  patterns  of  Vi,V2> • • • »VN*VN+1 
may  not  be  present  in  the  sample  image  or  there 
may  be  few  of  them.  Smaller  samples  can  cause 
inaccurate  estimation  of  the  P  (Vm-iAi»  • . .  ,VN)  as 
the  variance  of  our  estimate  is  larger  and 
therefore  the  expected  error  of  our  estimate  is 
larger  than  would  be  expected  with  a  larger  sample 
size.  In  these  cases  it  is  important  to  sun  over 
the  least  significant  kernel  elements  and  estimate 


P (^Mfl^l » ■  ■  rVfl)  by  £(Vj,1+^,  'x,  • . . ,’  tj) .  In  uur 
s*ujdy,  this-  '-as  done  if  th*»  sai.,p1  •'  s’xe  to  compute 

P(Vm/Vi . VN)  was  less  than  10.  The  variable 

i  is  increased  until  th<s  condition  is  met. 

Texture  simulations  usfiq  this  method  are 
shown  in  Figs.  15(b),  16(b)  and  17(b).  Visually, 
the  results  are  very  good.  As  the  estimated 
texture  generation  parameters  are  approximated 
using  statistics  gathered  from  the  full  parent 
texture,  non- homogeneity  in  the  parent  texture 
will  cause  an  "average"  texture  to  be  synthesized. 

tee  bark  texture  is  among  the  most  difficult 
to  simulate  due  to  its  very  unusual 
macro-structure.  Still,  the  N-gram  simulation 
looks  remarkably  similar  to  the  original  when 
windowed  regions  20  to  40  pixels  square  are 
observed.  tee  parent  texture  of  water  is 
non- homogeneous  as  the  waves  are  more  closely 
spaced  or.  one  side  that  on  the  other.  tee 
synthesis  contains  waves  of  an  average  size.  As 
we  are  attempting  to  synthesize  textures  and  not 
merely  "image  code"  the  parent  textures,  details 
and  non-homogeneities  will  be  lost  in  the 
synthesis  process. 

§.a  h inear  Model  Generation  of  Binary  Textures 

tee  process  of  choosing  the  V^’s  to  be 
present  in  the  texture  generation  kernel  described 
in  section  5  of  this  paper  actually  yields  a 
simple  linear  model  which  can  also  be  used  to 
generate  binary  textures,  tee  model  which  results 
from  the  determination  of  the  generation  kernel 
may  be  expressed  in  equation  form  as 

W  “  PlVl,k+B2V2,k+-+0NVNrk+eo+elt  (U) 

or  more  simply  as 

Vr6iVW-+VN  +  eo  +  E  (1Z) 

Chce  the  estimates  of  the  s  are  known,  a  pixel 
V may  be  calculated  from  a  set  of  given  values 
V^plus  an  error  c  .  In  one-dimensional  analysis 
this  is  sometimes  known  as  the  autoregressive  time 
series  model  [6],  For  binary  Vj  a  value  of  V^-n 
will  be  produced  which  is  non-binary,  te  generate 
binary  data  using  this  model  will  therefore 
require  quantization, 

In  the  N-gram  approach  to  texture  simulation, 
the  randomness  of  the  texture  is  induced  by  the 
generation  of  a  iniformly-distr  ibuted 
pseudo-random  variable  during  the  generation 
process,  tee  comparison  of  this  value  with  the 
estimate  of  the  generation  parameter, 
p(vN+l/Vl' ••  ■  'vn>  ,  yields  the  next  binary  pixel. 
A  similar  type  of  randomness  must  occur  in  the 
generation  of  binary  textures  using  the  linear 
model  of  Eq.  (12).  teis  randomness  is  expressed 
in  the  model  in  the  error  term  e. 

We  can  obtain  an  estimate  of  the  distribution 
of  e  in  the  same  manner  as  we  estimate  the  6 1  s  of: 
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the  model.  This  may  be  done  by  applying  the  model 
to  the  sample  data  from  which  it  was  derived  and 
observing  the  errors  That  is,  the  linear  model 
kernel  is  passed  ov§r  the  parent  texture  image  and 
at  each  point  a  V^+i  is  calculated  based A  on 
Eq.  (12)  without  the  error  term.  Then  V^g-V^i 
is  calculated  where  Vjj+i  is  the  actual  value  of 
Vfl+i  in  the  parent  texture.  The  histogram  of  the 
values  can  be  used  to  estimate  the  distribution  of 
e.  As  one  step  further  we  could  assume  that  has 
some  known  distribution  such  as  Gaussian  or 
normal,  and  -  eiy  estimate  the  parameters 
necessary  to  def  this  distribution.  In  the 
normal  distribution  case,  only  the  standaid 
deviation  (or  variance)  of  £  needs  to  be 
estimated.  The  mean  of  t  is  zero  in  the  linear 
model,  least-squares  distribution. 

Our  generation  process  then  consists  of  the 
calculation  of  t  BjVj+Bg  to  which  we  add  a  random, 
normal ly-distribuued  error  term  e  and  this  value 
is  then  quantized  to  0  or  1  based  on  comparison 
with  0.5.  Results  using  this  generation  method 
are  showi  in  Figs.  15(c),  16(c)  and  l7  (c) .  In 
these  figures,  N  was  allowed  to  be  as  large  as  70 
as  only  N  coefficients  (not  2^)  need  to  be  stored 
along  with  oe,  the  estimate  of  the  error  standard 
deviation. 


the  number  of  model  elements  is  greater  than  two 
or  three. 

In  our  study,  the  :  unpler  autoregressive 
model  is  used  and  is  allowed  to  contain  a  large 
number  of  parameters.  This  is  possible  using  the 
assumption  of  homogeneity  (stationarity)  combined 
with  the  forward  selection  process  of  choosing 
non-contiyuous  generation  kernels  as  described  in 
section  5,  These  models  are  extended  further  by 
a)  lowing  second-order  autoregressive  models  and 
non- stationary  noise.  Results  of  texture 
simulations  usina  these  models  are  included  in 
this  paper. 

9.  Tiie  Linear  Autoregressive  Model 

In  section  8  the  linear  autoiegressive  model  , 
used  to  determine  the  elements  of  the  generating 
kernel,  was  expressed  as 


=  *3  + 1. 


k=l,...,M 


V  =  V 
k  N+l ,k 


The  linear  model  simulations  are  slightly 
inferior  to  the  N~gram  simulations  but  the 
degradation  is  far  less  than  we  would  expect  from 
such  a  massive  compression  of  information  (which 
is  approximately  2  to  70).  The  results  were 
good  enough  to  encourage  the  application  of  the 
linear  model  to  continuous-tone  textures. 


McCormick  and  Jayaramamurthy  [7]  were  perhaps 
the  first  to  make  a  notable  attempt  to  simulate 
natural  textures  using  this  approach.  Their  work 
consisted  of  a  discussion  of  the  Box  and  Jenkins 
autoregressive  (AR) ,  moving  average  (MA)  and 
autoregressive  integrated  moving  average  (AFUMA) 
models  including  estimation  of  model  parameters 
and  adequacy  of  model  fit.  A  very  simple  model 
w?s  then  used  to  simulate  two  very  similar 
textures  by  tilling  in  the  holes  of  a  parent 
future  using  the  derived  model.  Chly  two 
textures,  both  exhibiting  a  wood-grain-like 
structure,  were  used.  Similar  work  was  done  later 
by  Tou,  Kao  and  Chang  [8],  Unfortunately,  the 
results  of  their  simulation  of  these  textures  were 
displayed  using  a  printout  of  Chinese  characters 
and  so  the  degree  of  success  of  their  method  is 
unclear.  The  appearance  of  texture  synthesis 
results  on  a  computer  printout  will  confuse  most 
observers  unaccustomed  to  such  crude  image 
displays.  The  models  were  ac:iin  very  simple  and 
contained  no  more  than  three  terms  in  the  linear 
model  summation.  Deguchi  and  Mortsbila  [9] 
attempted  to  use  the  linear  model  to  segment  and 
partition  textures.  Their  approach  was  only 
partially  successful . 

In  the  above  simulation  attempts,  the  models 
used  were  simple.  The  process  of  collecting 
statistics  and  estimating  parameters  is  complex . 
In  some  cases,  previous  authors  attempted  to  use 


the  complex  Box  and  Jenkins  ARIMA  model  which 
leads  to  difficult  model  parameter  estimation  if 
Here  $  is  an  (N+l)xl  vector  of  unobservable 
parameters  and  is  an  unobservable  random 

variable  such  that  E [e.^1  =  0.  The  sample  nunber 
(irclex)  is  denoted  by  k  and  M  is  the  total  numbe^ 
of  observations.  We  can  also  define  the  vectors  Y 
and  e  and  the  matrix  K  by 


£  Y, 

2  J  _  2 


and  our  model  may  be  expressed  as 

Y  =  X($  +  e  (15) 

In  equation  form,  dropping  the  k  subscript,  the 
model  becomes 
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Vl  =  WW" 


'+eNVN+B0+e 


(16) 


Sums  and  suns  of  squares  leading  to  the 
calculation  of  the  correlation  or  covariance 
matrix  of  the  parent  texture  are  obtained  by 
passing  any  chosen  generation  kernel  pattern  over 
the  texture.  From  tills  matrix,  the  least-squares 
parameter  estimate  of  6  is  obtained.  The 
multiple-pass  forward  selection  process  described 
in  section  5  leads  to  a  final  linear 
autoregressive  model  which  is  then  used  to 
generate  textures. 

The  linear  (autoregressive)  model  of  Eq.  (16) 
was  used  to  simulate  a  variety  of  natural 
textures.  Stationary,  independent  Gaussian  noise 
was  used  to  drive  the  synthesis  process.  The 
variance  of  the  noise  was  estimated  by  applying 
the  model  to  tl>  sample  data  and  observing  the 
prediction  errors.  These  errors,  which  are  often 
called  residuals,  are  pixels  formed  by  the 
difference  Vn+i-Vn+i  where  Vn+i  is  the  actual 
observed  pixel  value  of  the  sample  parent  image 
and  7n+|  is  the  correspond  ing  fitted  value 
obtained  by  use  of  the  linear  model.  The  standard 
deviation  of  these  errors  can  be  measured  and  used 
as  the  standard  deviation  of  pseudo- random 
non.ially-distr  ibjted  noise  in  the  generation 
process.  Actually,  this  information  can  also  be 
obtained  during  the  decomposition  of  the 
covar.ance  matrix. 

To  tie  this  method  with  the  previous 
sections,  this  method  of  texture  synthesis  is 
equivalent  to  defining  P  (Vfj+i/V^, . . .  ,VN)  as  a 
normal  distribution  with  mean 


•Wo 


and  var iance  VAR [ £  ] . 


(17) 


The  number  of  pixels  in  each  generation 
kernel,  N,  varied  from  30  to  60.  Tho  simulation 
results  are  shown  in  Figs.  16(b),  19(b)  and  20(b). 

These  simulations  indicate  that  the  linear 
model  using  stationary  gaussian  noise  produces 
acceptable  simulations  of  a  variety  of  textures 
including  wool  and  water. 


A  full  second-order  model  with  N  independent 
variables  will  employ  (N2+3N)/2  terms  in  addition 
to  the  60  (constant)  and  e  (error)  terms.  This 
general  second-order  linear  model  may  be  written 
as 


Vl  ■  eiV  2V~+  »v» 

N  N  N 

=y>  e.v.+y'  7  6..v.v.+  Bn+ 

i—t  i  i  L-j  ig  13  0 

i=l  i=l  j=l 


(20) 


Second-order  models  have  been  particularly  useful 
in  studies  where  surfaces  must  be  approximated  by 
polynomials  of  low  order.  In  all  cases,  a 
second-order  model  will  "fit"  qiven  data  as  well 
as  or  better  than  a  first-order  model  that  is  a 
subset  of  second-order  models.  This  does  not 
imply  that  the  second-order  model  will  be  more 
correct  however,  as  the  process  which  we  are 
attempting  to  model  may  be  in  fact  a  linear 
first-order  process  or  some  other  type. 

The  use  of  a  second-order  model  to 
approximate  the  surface  of  the  general  stochastic 
model  could  have  many  advantages  over  a 
first-order  model.  An  example  of  fitting  such  a 
model  in  one  dimension  to  a  given  set  of  data  is 
shoitfi  in  Fig.  8. 

Still  the.  linear  first-order  model  may 
provide  s  good  fit  to  the  data  and  the  magnitude 
of  the  unexplained  variance  in  the  data  may  be 
large  enough  that  the  improvement  due  to  the 
addition  of  second-order  terms  to  the  model  may  be 
barely  noticeable.  In  two  dimensions,  the  fitting 
problem  is  one  utilizing  a  quadric  surface  such  as 
a  elliptic  paraboloid  or  hyperbolic  paraboloid 
versus  a  plane  no  fit  a  given  set  of  data.  Again, 
the  fit  may  or  may  not  be  markedly  better.  Mding 
second-order  terms  to  a  model  will  always  produce 
a  fit  as  good  as  01  better  better  than  a 
first-order  model  but  the  number  of  computations 
required  to  compute  the  coefficients  and  fit  the 
model  are  much  greater. 


10.  Second-Order  Linear  Model 


When  we  say  that  a  model  is  a  linear  or 

nonlinear,  we  are  referring  to  linearity  or 

nonlinearity  in  the  parameters.  The  value  of  the 

highest  power  of  an  independent  variable  in  the 
model  is  called  the  order  of  the  model .  For 

example, 

y  =  e1v1+6nvi+s0+  E  U-8) 


is  a  second-order  linear  mooel.  A  general 
second-order  linear  model  with  two  independent 
variables  may  be  written  as 


Y  =  61V1+BJ\FB11VJ+B22V^+B12VJV2+B0+  e 


(19) 


It  is  also  important  to  note  that  the 
covariances  of  the  are  required  in  order  to 
obtain  least-square  estimates  of  the  parameters 
in  the  first-order  model  [10].  Covariance  is 
essentially  a  second-order  statistic.  Therefore, 
estimating  the  parameu  >vs  of  a  second-order  model 
will  require  the  use  of  fourth-order  statistics. 
Specifically  the  correlation  of  terms  V^jVl2  and 
Vi3Vi4  is  needed.  This  may  cause  serious  problems 
as  many  cases  the  variables  in  a  second-order 
model  will  he  highly  intercorrelateti.  For 
example,  tho  terms  V*,  vf  and  VjVg  (if  V’i  is 
highly  related  to  V^)  may  be  stiongly  correlated. 
This  situation,  often  referred  to  as 
multi-coil inearity,  may  cause  probl^s  during  the 
inversion  or  decomposition  cf  the  estimated 
correlation  matrix,  a  necessary  step  in  model 
parameter  estimation.  For  this  reason,  care 
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should  be  exercised  during  the  analysis  of 
second-order  models. 

Inside  a  circular  radius  of  14  pixels  from 

there  are  307  pixels,  lb  search  all  possible 
cross  products  in  this  region  to  find  the  most 
significant  would  require  over  47,000  cross 
products  to  be  examined.  Computation  of  a 
covariance  matrix  containing  all  of  these  terms  is 
impossible  (in  practice).  In  our  study  we  were 
limited  to  investigate  only  820  possible  cross 
products  for  entry  into  the  generation  model.  As 
most  of  the  variance  was  explained  by  the  linear 
terms  of  the  model,  most  of  the  cross  products 
ware  insignificant  from  a  statistical  point  of 
view.  This  selection  procedure  is  detailed  in 
[10]  ai._.  in  section  5.  Those  that  were 

significant  were  entered  into  the  model  and  a  new 
texture  was  generated  using  Eq.  (20)  with 
stationary  Gaussian  noise  and  having  zero  rean  and 
fixed  variance  . 

The  results  of  texture  simulations  using  the 
second-order  linear  model  are  shown  in 
Figs.  13(c),  19(c)  and  20(c).  On  some  of  these 
textures  only  a  slight  improvement  from  the 
addition  of  second-order  terms  may  be  seen.  In 
most  cases,  no  change  can  be  observed  even  when 
the  results  are  displayed  on  a  high- resolution 
display  device.  The  lack  of  improvement  could  be 
due  to  the  small  number  of  cross-terms  examined; 
however  we  feel  that  this  number  is  sufficiently 
iarge  to  show  any  considerable  improvanent  due  to 
the  addition  of  second-order  terms  to  the  linear 
model . 

11.  Textures  with  Non-Stationary  Noise 

Applying  a  texture  generation  model  to  the 
original  parent  texcure  image  data  used  to 
estimate  its  parameters  gives  a  residual  error 
image.  This  sequence  plot  of  residuals  in  the 
case  of  two-dimensional  texture  synthesis  is 
essentially  an  image  „  containing  th§  pixel 
differences  or  residuals  vNtl~%tl*  Here  VN+1  is 
the  prediction  of  the  next  pixel  in  the  sequence 
as  a  linear  function  of  the  pixel  around  it 

according  to  the  model  without  any  noise  added. 
Naturally,  we  would  expect  these  errors  to  be 
small  as  merely  subtracting  one  pixel  from  its 
nearest  neighbor  would  yield  a  small  value  in  most 
natural,  low-noise  images.  Definite  patterns  are 
seen  to  exist  in  these  images  and  thus  a  violation 
of  the  independent  assumption  is  indicated. 

Ideally,  this  residual  image  would  be  uncorrelated 
noise. 

The  distribution  of  this  error  and  the 

relationships  between  the  predicted  and  actual 

pixel  values  was  utilized  to  gererate  textures 
using  non-stationary  noise.  The  procedure  begins 
by  generating  a  pixel  V^+i  according  to  Eq.  (20) 
excluding  the  error  term.  With  this  predicted 
value  a  random  error  value  is  chosen  to  be  added 
to  %+g.  This  error  value  is  chosen^from  the 
distribution  of  error  as  a  function  of  and 

can  have  any  arbitrary  distribution.  The  next 
pixel  will  than  be  computed  in  a  similar  manner. 


Results  of  texture  synthesis  formed  using  this 
model  are  shown  in  Figs.  18(d),  19(d)  and  20(d). 

The  arbitrary  distribution  of  error  as  a 
function  of  is  calculated  by  applying  the 

calculated  linear  model  to  the  original  parent 
texture  and  computing  a  histogram  of  errors  as  a 
function  of  %f]_.  In  other  words,  „the 
distribution  of  e  depends  on  the  value  of  V„,, 
and  this  distribution  is  estimated  by  applying  the 
gjodel  to  the  original  parent  texture  and  observing 

Vi and  the  error  VrVi' 

In  most  cases,  considerable  improvement  is 
seen  tfien  these  simulations  are  critically 
observed  on  a  high-resolution  display  device  and 
compared  with  the  stationary  model  results.  Of 
course,  the  information  required  to  generate  them 
is  considerably  greater  also.  The  distribution  of 
errors  as  a  function  of  must  be  condensed  and 
coded  to  some  degree  1  to  minimize  stoj-age 
requirements.  For  a  256-grey-level  image 
usually  ranges  from  -50  to  305  and  the  errors, 
VN+1_VN+1'  ^rom  ”255  to  +255.  These  ranges  were 
determined  experimentally.  This  would  yield  quite 
a  large  amount  of  data  if  fully  stored.  By 
storing  a  small  number  (under  100),  typical  errors 
for  each  range  (and  not  each  single  value)  of 
the  number  of  data  values  we  are  required  to  store 
can  possibly  be  reduced  to  under  1000.  Therefore 
it  is  believed  that  this  approach  of  using 
non-stationary,  non-Gaussian  noise  to  generate 
textures  may  be  quite  acceptable  even  with  severe 
storage  limitations. 

i2.  Skip-Generate  Method 

Simulating  textures  tfuch  have  a  fine 
structure  is  usually  a  much  easier  process  than 
simulating  textures  with  coarse  structure.  This 
occurs  because  the  linear  model  contains  fewer 
terms  if  the  texture  pixels  become  uncorrelated 
over  a  small  distance.  For  the  same  texture  at  a 
greater  magnification,  the  pixels  become  highly 
correlated  and  the  linear  model  will  be  forced  to 
contain  more  terms.  As  the  texture  becomes  more 
coarse,  more  time-consuming  statistical 
measurements  must  be  taken  on  the  parent  texture 
over  larger  windows.  Motivated  by  these  problems, 
the  texture  generation  algorithms  in  this  section 
have  been  developed. 

l!i  the  texture  work  so  far,  pixel  Vj^+i  was 
generated  based  on  pixels  above  or  to  the  left  of 
it  (see  Fig.  5(b)).  As  discussed  in  section  5, 
the  kernel  does  net  have  to  be  contiguous.  This 
kernel  shape  is  chosen  to  insure  that  the  image 
space  of  our  synthesized  texture  was  filled  during 
the  generation  process.  However,  generating 
pixels  along  a  row,  row  by  row  is  not  tt-  only  way 
of  filling  an  image  space. 

Consider  the  non- contiguous  kernel  mask  in 
Fig.  9.  If  the  spacing  between  the  pixels  in  this 
mask  is  8,  using  the  linear  model  in  Eq.  (16)  to 
generate  the  right-most  pixel  in  the  bottom  row, 
we  can  generate  every  8th  pixel  along  every  8th 
row.  At  each  step  the  next  pixel  is  generated 
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based  on  the  previously-generated  pixels  around  it 
(ignoring  boundary  conditions) .  After  generating 
an  image  with  this  type  of  spacing,  the  pixels 
midway  between  the  previously-generated  pixels  cn 
each  row  may  be  generated  using  the  mask  in 
Fig.  10.  In  this  mask,  the  pixel  with  the  *x"  in 
it  denotes  the  next  pixel,  V^.^,  to  be  generated 
according  to  Eq.  (17).  i-  ally,  the  linear 
model  used  in  this  step  will  nave  different 
coefficients  than  the  previous  one.  It  is  also 
interesting  to  note  that  new  pixels  depend  ,ot 
only  on  previously  generated  pixels  above  them  (as 
with  the  mask  in  Fig.  5(b))  but  depend  also  on  the 
pixels  below  them.  Still,  ignoring  boundary 
conditions,  each  pixel  depends  only  on  previously 
generated  pixels.  At  the  next  step  a  mask  similar 
to  that  in  Fig.  11  can  be  used  to  fill  in  the 
pixels  midway  between  the  previously-generated 
pixels  in  each  column.  Again  pixels  are  allowed 
to  depend  on  pixels  around  them. 

Dy  repeatedly  using  the  masks  in  Fig.  10  and 
Fig.  11  with  successively  closer  and  closer  pixel 
spacing,  the  texture  simulation  image  space  is 
filled.  An  example  showing  the  pixels  generated 
at  each  successive  pass  is  shown  in  Fig.  12.  More 
importantly,  to  determine  the  linear  model  for 
each  mask,  only  one  covariance  matrix  is  required 
and  can  contain  as  many  or  as  few  terms  as 
desired.  The  process  of  collecting  statistics  for 
one  matrix  is  not  beyond  the  complexity  that  wa 
would  want  to  undertake  for  the  small  number  of 
times  required  by  this  process.  Naturally,  any 
other  stochastic  process  may  be  substituted  for 
the  linear  model.  As  before,  only  the 
measurements  required  to  es'_  mate  the  parameters 
corresponding  to  each  mask  need  to  be  taken.  This 
number  depends  on  the  spacing  of  the  pixels  in  the 
first  mask,  which  should  be  a  power  of  two.  Ocher 
odd -shaped  kernels  and  kernels  whose  spacing  is 
not  a  power  of  two  could  be  designed  to  form 
space-filling  sets.  Most  would  require  more 
models  to  be  estimated  and  would  provide  little 
additional  information. 

Texture  simulations  using  this  method  are 
shown  in  Figs.  18(e),  19(e)  and  20(e).  Only  a 
slight  improvement  is  seen  in  some  of  the  texture 
simulations  over  the  synthesis  done  by  the  earlier 
single  linear  model.  Most  of  these  textures  are 
apparently  well  simulated  by  a  carefully  chosen 
model  and  the  results  are  not  critically  dependent 
on  the  coarseness  of  the  textures. 

A  word  of  caution  should  be  added  concerning 
the  computations  involved  in  the  linear  model 
coefficient  calculation  of  this  method.  During 
the  later  stages  of  the  skip-generate  method,  the 
pixels  in  the  generation  kernel  become  highly 
correlated  as  the  distance  between  them  decreases 
with  each  pass.  This  may  cause  the  correlation  or 
covariance  matrix  of  the  model  to  be 
ill-conditioned,  lb  avoid  numerical  problems,  the 
number  of  variables  entered  into  the  process,  and 
therefore  the  number  of  steps  involved  in  the 
matrix  decomposition  p-ocess,  should  be  kept  to  a 
minimum  in  some  cases.  The  use  of  ridge 
regression  techniques  [11]  might  also  be 
considered . 


13.  Best-Fit  Texture  Model 


A  method  of  generating  texture  simulations 
according  to  their  Nth  order  densities  was 
investigated  for  binary  textures  in  section  4. 
The  simulations  resulting  from  this  Martov  process 
resembled  their  parental  textures  quite  closely  in 
mo3t  cases.  When  applying  a  similar  concept  to 
multi-grey  level  imagery,  the  limits  of  computer 
storage  are  soon  reached.  To  circumvent  this 
constraint,  a  new  method  of  texture  synthesis  was 
developed  and  applied  to  a  number  of  textures. 


In  binary  texture  generation  based  on  N-grams 
a  single  functional  value  P(Vn+jA- » •  •  •  »vn>  was 
stored  for  each  possiDle  pattern  (v^V,, . . .  ,V^.) 
where  the  Vy’s  can  he  zero  or  one.  This  value, 
also  called  a  generation  parameter,  represented 
the  conditional  probability  that  the  next  pixel, 
Vfl+i,  in  the  generation  process  would  be  a 
zero-valued,  black  pixel.  The  Vi's  weie  chosen  by 
a  best  linear  model  fit  detailed  in  section  5  and 
therefore  the  kernel  of  previous  pixels 
(Vp ...,VN)  is  not  necessarily  contiguous  (see 
Fig.  6).  Details  concerning  the  estimation  of 
from  a  Parent  texture  are  given 
in  section  4.  For  binary  textures,  this  single 
value  is  sufficient  to  define  the  distribution  of 
VN+1  9iven  Vi, ...,VN.  The  number  of  different 
functions  which  must,  be  stored  is  2  .  In  the 
generation  process  each  pixel  Vf^i  is  generated 
based  on  the  values  of  the  pixels  Vp 
surrounding  it  and  on  a  computer-generated 
uniformly-distributed  random  variable.  The 
texture  simulations  are  generated  pixel  by  pixel 
along  a  row  until  each  row  is  complete.  Pixel 
generation  along  the  edges  of  an  image  ccn  be 
nardled  in  a  variety  of  ways  but  in  section  4 
pixels  in  these  border  regions  vere  assumed  to  be 
any  random  value,  0  or  1,  if  they  were  outside  the 
image  boundaries. 


A  similar  approach  could  be  used  to  generate 
mul  ti-cjrey-level  textures.  For  a  texture 
containing  g  grey  levels,  g1*^  different 
functions,  PIV^./V,, . . .  ,VN) ,  must  be  stored. 
(Actually  only  (g-l).gN  are  required  as 

P(X,A11,...,Vn)*1  for  all  Vy).  Storage 


Vf). 

Also  estimation  ot 


limitations  are  soon  reached. 

P{VN+1'’V1'-',,VN.)  is  di£ficult  as  multiple 

occurrences  ot  the  pixel  pattern  may  not 

exist  in  the  parent  texture.  Therefore  even 
without  storage  limitations  the  problems  of 

estimating  P  (%^/Vp  ...,VN)  from  a  given  parent 
texture,  which  represents  true  distribution  of 
given  the  values  of  Vp...,V^  is  complex. 

This  estimation  problem  no  doubt  has  a  number 
of  ad  hoc  solutions.  The  problem  is  basically 
thatTTor  Targe  N  and/or  large  g,  there  may  not  be 
a  suitable  number  of  occurrences  of  the  pattern 
Vi,...,VN  to  adequately  estimate  the  distribution 
pV^./V,  ,.  ..,VN)  given  a  finite  sample  size. 
Even  though  a  certain  pattern  never  occurs  or 
rarely  occurs  in  our  sample  parent  texture  it  is 
not  implied  that  such  a  pattern  is  impossible  and 
will  never  occur  in  our  simulation  synthesis.  We 
might  often  find  numerous  occurrences  of  this 
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pattern  if  our  sample  size  or  the  size  of  our 
parent  texture  was  increased,  especially  in  noisy 
and  fine-structured  textures.  But  as  this  very 
large  sample  may  not  be  present,  we  must  estimate 

P<WVi . VN)  for  all  Vlf...,VN  based  on 

available  samples. 

One  approach  would  be  to  use  sample  patterns 
which  closely  resemble  but  which  may  not  be 
exactly  the  same  as  each  pattern  (V^ , ..  .,Vfo)  • 
That  is  in  a  pictorial  sense,  we  use  patterns  of 
(Vi,...,\^j)  vrtiich  look  "close  to"  the  pattern  for 
which  we  are  attempting  to  estimate 

Ptv^i/Vi . VN).  Therefore  samples  in  our 

sample  parent  texture  may  be  used  to  estimate 
numerous  P(V,/V,  ,...,VL)  and  not  just  those  they 
fit  exactly.  The  concept  of  a  distance  function 
must  be  used  to  numerically  define  "close  to". 
Given  two  patterns,  one  from  our  sample  texture 
and  the  other  from  the  conditional  probability  of 
the  kernel  we  are  attempting  to  estimate,  the 
distance  measure  can  be  used  to  determine  the 
value  of  that  sample  in  estimating 
P  (V^j/V  j, . . .  ,VJ  ,  If  the  fit  between  the  kernel 
pattern  and  the  pattern  in  the  sample  texture  is 
good  the  associated  value  of  in  the  parent 

texture  will  be  valuable  in  estimating 

P(VN+1/V1,",'V* 

Normally,  when  N  and  g  are  small  or  when  we 
have  many  samples  for  any  given  Vj,...,VN,  we  can 
use  the  histogram  of  the  associated  Vn+1  to 
estimate  P(VN+1/Vlf . . .  ,V„) .  Here  the  relative 
number  of  times  a  particular  value  of  Vfj-n  occurs 
given  a  pattern  indicates  the  conditional 
probability  we  are  attempting  to  estimate.  This 
was  discussed  in  section  4.  Where  a  distance 
measure  is  used  instead,  a  good  fit  could  be 
considered  to  be  synonymous  with  high  frequency  of 
occurrence  of  that  pattern  and  a  poor  fit  with  low 
frequency  of  or  nr  rente. 

If  such  a  method  of  estimating  these 
conditional  probabilities  is  used  ve  are  still 
faced  with  a  huge  storage  problem.  For  this 
method  to  be  practical,  the  storage  requirement 
must  be  reduced.  From  an  information  standpoint, 
it  is  interesting  to  note  that  a  method  of 
estimating  N-grams  or  conditional  probabilities 
P(VN+,J/Vij...,VN)  from  a  sample  parent  texture 
image  produces  g1^1  data  values  from  M  pixel 
samples  where  M  is  the  size  of  the  square  parent 
texture  image  in  pixels.  For  large  g  and  N  this 
is  a  drautic  increase  in  data.  But  the  actual 
info rmat ion  content  can  really  never  be  greater 
than  that  content  of  the  sample  parent  texture 
Image.  Therefore,  this  M  value  represents  an 
upper  bound  on  the  amount  of  data  we  should  use  to 
generate  a  texture  simulation.  Any  amount  of  data 
exceeding  this  will  contain  redundant,  useless 
data. 


texture.  This  comparison  is  made  by  passing  the 
kernel  currently  present  in  the  simulation  process 
over  the  parent  texture  and  computing  the  distance 
function  at  all  possible  points  (see  Fig.  13). 
Denoting  tile  pixels  in  the  parent  texture  by  Xy 
i  ,j“0, .. .  ,,4^-1  and  the  pixels  in  the  kernel 
Vi,...,tyj  by  Yi,j,  we  can  compute  a  comparison 
image 

vwsd  .  =  yy (x. ,  .... -y.  ,)2.w.  .  (21) 

i+a,g+b  i,g'  i,g 

i  j 

vhere 


W.  .  =  — - r— - j  =  T  (22) 

^l-LNEXT^  +  ^~-’nEXT*  R 

vrtiere  r  is  the  euclidean  distance  between  pixel 
Y l  a  and  the  kernel  eye  Yni^XT  arid  the  coordinates 
of'the  eye  are  given  by 

As  the  first  step  in  comparing  a  given  kernel 
to  all  kernels  in  the  parent  texture,  for 
eacn  point  (a,b)  in  the  parent  texture,  ignoring 
edges,  the  WMSD  is  computed  resulting  in  an  image 
of  WMSD’s.  Where  the  fit  between  the  generated 
kernel  Y^  j  and  the  image  Xjtj  is  good,  we  would 
expect  WMs6^b  to  be  small.  'The  smallest  \*MSD 
represents  t'he  "best"  fit  according  tc  our  norm. 
We  could  choose  the  Ynext  associated  with  this 
best  fit  at  point  (a,b)  to  be  our  next  pixel  in 
the  generation  process,  however  this  can  cause 
problems.  First  of  all,  the  generation  process 
would  "lock  j.n"  on  the  parent  texture  and  the 
generated  texture  could  very  well  become  just  an 
exact  copy  of  the  input  parent  texture.  Second, 
we  know  ideally  that  has  a  distribution,  not 

just  a  mean.  In  the  autoregressive  model  of 
section  9  we  gave  a  distribution  by  adding 

random  noise  to  it.  Although  this  could  be  done 
here,  such  an  approach  would  fail  to  use 
additional  information  contained  in  the  ViMSD 
image.  There  may  be  a  set  of  points  (a,b),  all 
exhibiting  a  good  fit  to  the  kernel  pattern  Yj.,j. 
In  fact,  the  best  fit  may  have  a  noisy  Y^g^  and 
the  other  good  fits  could  provide  information  to 
improve  the  prediction  of  the  YIJEXT  in  the 
generation  process,  ijsing  a  set  of  best  fits  is 
equivalent  to  increasing  our  sample  size.  We  look 
at  a  set  of  similar  patterns  to  pick  our  Y^g^rf.. 

At  this  point  there  ere  numerous  ways  to 
proceed.  Logically  those  patterns  with  tire  "best" 
fit  should  provide  better  estimators  for  Yjjgyp  so 
some  kind  of  weighting  decision  is  needed  to 
choose  the  relative  importance  of  the  toMJD’s 
found.  If  we  search  through  the  WMbD  image  and 
find  the  minimum  value,  ("MSD,,^,  and  scale  all  the 
WSD’s  by  that  we  form  a  new  Image  MAXI 


Combining  this  concept  of  upper  bound  with 
the  idea  of  forming  a  distance  measure  to  compare 
two  texture  kernel  patterns  leads  to  a  new  texture 
synthesis  method.  In  this  method,  we  generate  the 
next  pixel  based  on  the  pixels  in  the  kernel 
surrounding  it  (see  Fig.  5(b))  and  their 
comparison  bo  similar  kernels  in  the  parent 


MAXI  . 
a  ib 


VMSD 


min 


VMSD. 


a,b 


(23) 


This  image  has  the  value  1.0  at  the  best  fit  point 
and  values  0  <  MAXI  <  1.0  elsewhere. 
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Here  we  can  look  at.  the  MAXI  (a, b)  image  and 
study  its  range.  If  0.16  <  MAXI  £  1.0  it  is 
implied  that  the  worst  fit  yields  a  0.16  value. 
Somehow  that  worst  fit  should  be  translated  to 
imply  that  the  probability  of  choosing  the  Vneot 
associated  with  that  point  (a»b)viorst  *s  near^y 
0.0.  Tic  simplest  way  of  doing  that  Is  to  take 
powers  of  the  image  MAXl(a,b).  The  maximum 
remains  1.0  while  smaller  njnbers  approach  0.0. 
For  example  (1.0)^®*1.0  but  (0.16)1  -lxlCT®.  We 
do  this  to  obtain  an  ad  hoc  estimate  of 
P^NEXT^i  After  experimentally  studying  the 

values  of  ftAXl(a,b)  and  its  powers,  the  value  of 
16  was  chosen  and  a  new  image  PDFUNS 

PDFUNS  .=  (MAXI  ,  ) 16  (24) 


was  used  to  estimate  the  probability  density 
function  PiVneXj/Vi,  j) .  PDFUNS  is  then  scaled  so 
that  PDFUNS  (a  ,d)  =1 .  Finally  a  uniformly 

distributed  random  variable,  r,  [0,1]  is  generated 
and  a  point  (c,d)  is  found  such  that 

c-1  d-1 

2  S  PDFUNSa,b  +  Y  PDFUNSc,b  <  r 
b  b=l 


PDFUNS  ,  + 
a,b 


E 


b=i 


PDFUNS  .  <  r 


(25) 


The  Y NEXT assoc iated  with  the  kernel  shape  at 
(c,d)  is  then  used  as  tho  next  pixel  in  the 
generated  image.  The  process  is  continued  until  a 
full  texture  image  is  generated  with  the  kernel 
window  moving  one  pixel  at  each  step,  row  by  row. 


processor  is  dedicated  to  the  task.  About  5.5 
days  of  dedicated  time  on  an  AP120B  were  required 
to  generate  each  texture. 

Although  this  method  is  of  little  practical 
use  due  to  the  computational  complexity  of  the 
algorithm  a  few  points  should  be  made.  With 
constantly  increasing  computer  processing  speeds, 
a  simplified  version  of  this  texture  simulation 
method  may  be  implemented  in  the  near  future.  It 
is  even  possible  that  such  computations  could  be 
performed  by  an  array  of  microprocessors.  In  any 
case  such  brute-force  approaches  are  simple  and 
could  be  made  cost-effective  in  the  future.  The 
results  also  indicate  visually  the  amount  of 
texture  information  present  in  a  55  pixel  window 
(see  Fig.  14)  because  at  each  pixel  generation 
step,  the  next  pixel  in  the  Markov  process  depends 
on  only  the  pixels  in  this  neighborhood. 

Finally,  this  approach  is  admittedly  ad  hoc. 
Numerous  distance  measures  could  replace  the  one 
chosen  in  this  work  and  each  would  give  different 
results  that  might  appear  better  or  worse.  It  is 
always  important  that  the  process  be  random  and 
not  merely  copy  the  texture  sample.  If  the 
simulation  region  is  much  larger  than  the  parent 
sample,  a  deterministic  process  will  quickly 
generate  patterns  that  can  easily  be  seen  to 
repeat.  In  other  words,  the  histogram  represented 
by  P  ^N+l^l '  ‘ should  rarely  be  a  delta 
function.  A  reduction  in  the  number  of 
computations  could  be  made  if  the  kernel  was 
non-contiguous.  Also,  better  results  could 
probably  also  be  obtained  if  the  kernel  window  was 
larger.  The  shape,  contiguity  and  size  of  the 
kernel  in  this  study  was  chosen  primarily  for 
computer  programming  considerations. 


In  an  indirect  way,  this  is  equivalent  to 
generating  a  random  variable  having  any 
distribution  using  the  desired  cumulative 
distribution  combined  with  a  uniformly  distributed 
random  variable  (v*uch  is  easy  to  generate).  In 
other  words,  uniformly-distributed  deviates  are 
transformed  to  deviates  having  the  desired 
distribution  using  the  inverse  cumulative  density 
function.  This  is  frequently  done  in  simulations. 


The  results  from  this  best-fit  texture 
synthesis  method  are  very  pleasing  but  the  number 
of  computations  required  is  large.  Other  similar 
algorithms  could  be  developed  irtiich  are  simpler 
and  could  possibly  produce  even  better  results. 
With  the  decrease  in  computation  costs  and  the 
increase  in  processor  speeds  of  future  computers, 
such  texture  synthesis  methods  could  be  easily 
implemented  in  the  future. 


For  a  kernel  containing  55  pixels  (see 
Fig.  14)  passed  over  a  128x128  parent  texture 
approximately  7.2x10°  operations  (additions  or 
subtractions)  are  needed  to  get  the  kMSD  image 
defined  by  Bg.  (21).  Another  2.6xl05  are  required 
to  find  the  next  pixel  according  to  Eq.  (25). 
therefore,  to  generate  a  512x512  texture  requires 
1.96x10^  (2  trillion)  operations. 

Results  from  texture  synthesis  done  by  this 
method  are  shown  in  Figures  18(f),  19(f)  and 

20(f).  Each  of  these  images  is  512x512  pixels.  A 
128x128  section  of  each  original  (parent)  texture 
was  used  for  the  simulation.  Bark  exhibits  very 
large  macro  structure  and  this  is  lost  in  the 
simulation.  Still  this  simulation  is  better  in 
many  ways  than  those  obtained  using  other  models. 
The  large  number  of  operations  makes  this  process 
very  time  consuming  even  when  a  pipelined 


14 ,  Conclusions 

The  N-gram  model  of  section  4  was  an 
extension  of  earlier  one-dimensional  studies 
applied  to  two-dimensional  natural  textures.  The 
results  were  very  good  even  with  the  severe 
constraint  imposed  by  the  upper  limit  on  the 
number  of  pixels  allowed  in  the  generation  kernel. 
In  section  8,  the  binary  linear  model,  which  was 
used  to  determine  the  contents  (shape)  of  the 
generation  kernel  (see  section  5),  was  used  to 
generate  binary  textures.  The  textures  generated 
using  this  model  were  nearly  equal  in  quality  to 
those  of  the  more  complex  and  storage-consuming 
N-gram  model.  The  N-gram  model  of  section  4  uses 
a  generation  kernel  whose  contents  (shape)  depends 
on  the  linear  model.  Therefore,  the  number  of 
computations  required  in  the  statistics  collection 
portion  of  the  N-gram  model  necessarily  includes 
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computations  of  the  linear  mode],,  However,  in 
some  cases,  points  which  lie  far  from  the  kernel 
eye  can  be  neglected  in  the  N-gram  model  as  only 
the  best  few  are  used  due  to  storage  limitations. 
On  the  other  hand,  such  points  should  be  included 
in  the  linear  model  therefore  a  larger 
neighborhood  surrounding  the  kernel  eye  should  be 
used  in  the  estimation  of  the  linear  model. 

In  section  9,  the  linear  autoregressive  model 
was  applied  to  256-gray-level  imagery.  The 
results  were  good  considering  the  vast  reduction 
of  information  caused  by  the  statistics  collection 
process.  Slightly  better  results  were  obtained  by 
allowing  the  model  to  contain  cross  terms  but  the 
resulting  complexity  suggests  that  the  change  in 
texture  quality  is  not  worth  the  added  effort  and 
computational  expense.  Using  non-gaussian, 
non-stationary  noise  in  the  model  produced 
slightly  better  results  but  with  a  requirement  of 
slightly  increased  storage. 

The  skip-generate  method  of  section  12  may  be 
used  to  improve  the  simulation  of  textures  having 
a  coarse  structure.  The  model  produces  results 
equal  in  quality  to  the  linear  autoregressive 
model  while  requiring  fewer  computations  in  the 
collection  process. 

The  best-fit  model  of  section  13  represents  a 
brute- force  approach  to  texture  synthesis.  Though 
computationally  demanding,  the  final  results  show 
that  excellent  texture  simulations  can  be 
generated  using  complete  statistics  from  a 
relatively  small  neighborhood.  The  problems  with 
a  small  neighborhood  are  seen  in  the  simulation  of 
textures  such  as  bark  where  the  size  of  the 
primitives  in  the  texture  is  much  greater  than  the 
window  used  in  the  best-fit  calculation. 

It  would  be  unwise  to  believe  that  all 
textures  could  be  generated  using  any  single 
approach,  especially  one  vhich  promises  to 
compress  texture  information  to  a  handful  of 
nunbers.  Yet  this  is  precisely  what  has  been 
attempted  in  the  texture  synthesis  work  of  this 
paper.  These  method  could  be  modified  and 
combined  to  form  even  more  powerful  texture 
synthesis  techniques. 

It  is  important  to  note  the  power  and 
complexity  of  each  synthesis  method  of  this  paper. 
Many  textures  can  be  simulated  well  using  simple 
models  such  as  the  autoregressive  model  if  the 
model  is  carefully  constructed.  Improvements  in 
texture  simulation  were  made  by  modifying  these 
models  and  allowing  them  to  become  more  complex 
and  use  more  information  in  the  generation 
process.  Other  textures  require  more  complex 
models  such  as  the  best-fit  model.  The 
shortcomings  of  each  method  will  constantly 
indicate  where  future  work  can  be  done. 
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Fig.  10.  Second-pass  generation  kernel. 


Fig.  11.  Third-pass  generation  kernel. 
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Fig.  12.  Filled  space  of  skip-generate 
kernel. 


Fig.  13.  Passing  kernel  over  parent 
taxture. 
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Fig.  14.  Best-fit  model  kernel. 


(a)  Original 


(b)  N-gram 


(c)  Linear  nodel 


Fig.  15.  Binary  tooI. 


(c)  Linear  Model 


Fig.  17.  Binary  bark. 
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(b)  First-order  linear  model 
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(d)  Second-order  linear  model  with  nonstationary 
noise 


(e)  Skip-generate  model 


(f)  Best-fit  model 


Fig.  18.  Wbol. 
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ABSTRACT 

Two  related  methods  for  hierarchical  repre¬ 
sentation  of  curve  information  aro  presented, 
''irst,  edge  pyramids  arc  defined  and  discussed. 

An  edge  pyramid  is  a  sequence  of  successively 
lower  resolution  images,  each  image  containing 
a  summary  of  the  edge  or  curve  Information  in  its 
predecessor.  This  summary  includes  the  average 
magnitude  and  direction  in  a  neighborhood  of  th* 
preceding  image,  at  well  ea  en  Intercept  in  that 
neighborhood  and  a  measure  of  the  error  in  the 
direction  estimate.  An  edge  quadtree  le  a  vari¬ 
able-resolution  rspteaentstlon  of  the  linear 
Information  in  the  image.  It  ia  constructed  by 
recursively  eplltting  the  image  into  quadrants 
based  on  magnitude,  direction  and  Intercept  in¬ 
formation.  Advantages  of  the  edge  quadtree  repre- 
aentutlon  are  Its  ability  to  represent  several 
linear  features  in  a  single  tree,  its  registration 
with  the  original  image,  and  its  ability  to  per¬ 
form  many  common  operations  efficiently. 


INTRODUCTION 

Edges  provide  much  Information  about  the 
contents  of  an  image.  Often  thr.i  Information  is 
hard  to  interpret  because  of  the  large  amounts  of 
data  Involved  end  the  existence  of  spurious  edges 
that  arise  froa  noise  in  the  image .  This  paper 
presents  two  related  approaches  to  represent  lag 
edges  that  attempt  to  overcome  some  of  the  diffi¬ 
culties  in  analyzing  *dges.  The  first  approach  Is 
bleed  on  the  use  of  a  pyramid,  or  sequence  of 
images,  each  e  lower  resolution  version  of  its 
predecessor.  The  second  Involves  a  variable  resolu¬ 
tion  representation,  In  which  the  local  consistency 
of  the  edges  determines  the  resolution  »t  which 
they  are  represented .  This  approach  builds  a 
quadtree  from  the  image,  with  leaves  in  the  tree 
storing  information  about  the  edges  that  pass 
through  square  subregions  of  the  image.  Both  of 
the  representations  sre  able  to  represent  not  only 
edges,  but  any  linear  information. 

Many  resesrehars  have  taken  advnncage  of  the 
pyramid  structure  to  devise  various  efficient 
image-processing  algorithms  ( {8] ,  [14j,  [30],  [31), 
[32]),  Host  of  Lhasa  algorithms,  howeter,  have 
dealt  with  images  containing  cxtendfn'.  homogeneous 


.egir.s,  or  fc!obs.  The  use  of  pyramids  in  linear 
feature  analysis  is  ulgnif icently  more  simplex 
than  in  region  analysis.  This  is  because  e  pyramid 
is  well  suited  for  representing  images  whose  major 
feacuree  are  two  dimensional.  Such  features  tend 
to  retain  their  integrity  and  remain  recognlcahle 
when  lo  rer  resolution  versions  of  the  luage  are 
constructed  using  a  simple  rule  such  at  averaging 
over  seal!  local  neighborhoods.  In  contrast,  the 
important  features  of  edge  or  curve  images  are 
concentrated  in  a  small  proportion  cf  the  Image, 
and  It  is  the  poa.lt ions  and  orientations  of  the 
edges  or  curves  that  are  the  Important  information 
in  the  Image.  This  paper  provides  a  method  of 
constructing  edge  pyramids  that  allow?  the  advan¬ 
tages  of  the  pyramid  structure  to  be  applied  to 
edge  and  curve  Images.  Some  of  these  advantages 
Include  the  compression  of  date  to  manageable  size, 
and  the  ability  to  direct  costly  analysis  in  small 
regions  of  the  original  image,  or  set  parameters 
such  as  thresholds.  Projecting  down  from  s  glvsn 
level  in  the  pyramid  also  gives  rise  to  an  Image 
in  which  all  the  features  ere  of  e  known  minimum 
size,  and  which  hue  hod  nuch  of  the  noise  smoothed 
out . 

A  system  has  been  Implemented  that  builds 
pyramids  from  edge  luges,  and  can  reconstruct 
edge  images  from  low  resolution  levels  in  the 
pyramid .  This  system  Includes  an  edge  enhancement 
schema  that  is  interesting  in  its  own  right.  It 
also  shows  empirically  the  ability  of  an  edge 
pyramid  to  retain  most  of  the  useful  information 
at  low  resolutions,  end  the  ability  to  reduce  the 
amount  of  noise  in  the  image. 

The  edge  quadtree  representation  uses  the 
sane  information  as  the  edge  pyramid,  but  has  the 
ability  to  change  the  resolution  of  the  representa¬ 
tion  to  accou.it  far  the  edge  Information  in  the 
image.  Thus,  where  edges  ere  long  end  have 
consistent  directions,  large  portions  of  the  eeges 
can  he  represented  by  leaves  high  up  in  the  quad¬ 
tree.  Where  edges  ere  close  together,  however, 
or  at  corners,  much  smaller  leaves  may  be  needed 
to  represent  the  edge  information.  This  uives 
rise  to  a  polygonal  approximation  of  the  curve 
information.  The  quadtree  is  shown  to  be  useful 
for  representing  several  edges  or  curves  In  a 
single  structure,  In  ccntraet  to  other  representa¬ 
tions  like  the  strip  trees  of  Ballard  ([1]),  the 
upright  rectangles  ol  Burton  ((4)),  and  the  chain 
codes  of  Freeman  ((18)).  The  representation  allows 
many  common  edge  operations  to  be  performed  effi¬ 
ciently,  and  the  fact  that  It  is  In  registration 
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with  the  image,  and  with  ordinary  quadtree  repre¬ 
sentations  of  region-like  information  built  from 
the  image,  simplifies  Interaction  between  region 
and  edge  operations. 

EDGE  PYRAMIDS 

Building  an  ediu;  pyramid 

A  pyramid  to  lie  used  in  edge  or  curve  analysis 
cannot  simply  be  constructed  by  building  an  aver¬ 
aged  pyramid  and  then  applying  an  edge  detector  at 
each  level.  This  is  because  smoothing  in  the 
pyramid  might  cause  some  edges  to  he  missed,  while 
those  edges  that  are  found  will  be  displaced  rela¬ 
tive  to  the  original  image  because  the  edge  detec¬ 
tor  is  operating  on  a  different  image.  Were  it 
not  the  case  that  direction  information  is  crucial 
in  edge  analysis,  a  pyramid  could  be  constructed 
from  an  edge  magnitude  image  by  using  the  maximum 
value  of  the  magnitude  in  each  neighborhood  of  the 
image  as  the  value  of  a  point  in  the  successor 
level  of  the  pyramid.  Such  an  approuch  has  been 
used  to  construct  a  pyramid  of  line  Information 
(Hanson  &  Risetnan,  (7]>. 

For  the  purposes  of  this  discussion  it  is 
assumed  that  an  edge  image  has  been  constructed 
with  information  stored  at  the  pixels  through 
which  the  edges  pass.  In  order  to  construct  an 
edge  pyramid  that  preserves  as  much  Information 
as  possible  between  levels,  certain  Information 
must  be  present  at  each  point  of  each  level,  as 
a  amnm*r\  of  the  Information  of  the  corresponding 
neighborhood  of  its  predecessor.  In  the  implemen¬ 
tation  to  be  described  below,  the  neighborhoods 
were  of  sire  A  by  4,  with  overlap  between  adja¬ 
cent  neighborhoods.  The  minimum  information  to 
be  stored  at  each  point  is  an  estimate  of  the 
magnitude  and  direction  of  the  edge(s)  through  the 
corresponding  neighborhood.  In  addition,  an  inter¬ 
cept  point  is  needed  to  fix  the  position  of  an  edge 
in  a  neighborhood.  A  measure  that  is  also  useful 
iu  an  Indication  of  the  error  in  the  direction 
estimate.  Errors  will  usually  be  high  at  corners 
or  where  mote  than  one  edge  passes  through  a  neigh¬ 
borhood.  The  error  terra  can  be  used  to  signal 
such  situations,  and  cause  higher  levelo  of  the 
pyramid  to  ignore  such  regions.  This  gives  rise 
to  a  class  of  edge  quadtrees  (Section  3). 

Thus,  it  is  necessary  to  provide  a  means  of 
estimating  the  magnitude  and  direction  of  the 
edge(s)  passing  through  a  neighborhood,  and  an 
intercept  point  for  each  edge.  To  simplify  matters, 
and  to  prevent  the  amount  of  information  stored  at 
each  point  from  growing  in  an  unbounded  way,  each 
neighborhood  is  restricted  to  having  a  single  edge 
paaolng  through  It.  Should  more  than  one  edge  pass 
through  a  neighborhood,  the  best  edge  is  sought 
using  the  following  procedure. 

The  neighborhoods  used  In  the  implementation 
were  ot  site  4  by  A,  with  et  h  neighborhood  sharing 
two  rows  with  Its  vertical  nt  iborbors  (North  and 
South  of  It) ,  and  two  columns  with  its  horiaontal 
neighbors  (West  and  East) .  The  neighborhoods  are 


shown  in  Figure  1.  The  method  is  baaed  on  the 
observation  that  the  central  2  by  2  regions  ere 
disjoint  and  cover  '.ho  picture.  Thus,  by  first 
finding  the  best  path  through  the  central  regions, 
and  then  extending  it  to  the  full  4  by  4  neighbor¬ 
hoods,  the  complexity  of  computation  is  significantly 
reduced . 

Each  point  in  an  edge,  image  contains  two  pieces 
of  information,  a  magnitude  and  a  direction.  With¬ 
in  the  central  2  by  2  region  of  each  4  by  4  neigh¬ 
borhood,  the  points  having  the  maximum  magnitude 
are  found.  Based  on  these  values,  a  dec <sion  is 
made  as  to  whether  or  not  an  edge  exists  in  the 
neighborhood,  and,  if  an  edge  exists,  what  kin  I  of 
edge  it  is. 

If  the  maximum  value  is  greater  than  some 
minimum  (currently  2  in  the  implementation)  and  the 
next  to  maximum  value  ia  greater  than  the  minimum 
value,  an  edge  with  two  points  in  the  central  region 
is  assumed  to  exist  unless  the  directions  of  the 
two  points  are  not  consistent  (i.e.,  differ  by  more 
than  45  degrees).  If  one  point  is  significantly 
greater  than  the  rest  in  the  central  region,  the 
assumption  is  ma'de  that  an  edge  passes  through  the 
4  by  4  neighborhood,  but  only  touches  one  corner  of 
the  central  2  by  2  region.  If  no  point  differs 
significantly  from  Its  neighbors  (i.e.,  by  more  than 
2) ,  no  edge  Is  assumed  to  paaa  through  the  4  by  4 
neighborhood.  It  would  be  represented,  however,  by 
the  adjacent  neighborhood  through  whose  center  It 
passed . 

It  must  be  understood  that  two  kinds  of  direc¬ 
tion  information  are  used  in  evaluating  edge  contin. 
ualion  and  edge  consistency.  First,  there  is  the 
direction  established  by  the  edge  detector  as  an 
estimate  of  the  direction  of  the  edge  through  a 
point.  Second,  there  is  the  direction  from  a  point 
in  the  grid  of  the  image  to  another  point.  For 
example,  a  point  has  eight  immediate  grid  neighbors, 
at  angles  ol  0,  45,  90,..,,  degrees  around  it. 

These  fixed  angles,  together  with  the  directions 
calculated  for  edges  at  a  point  and  its  neighbors, 
are  used  to  establish  the  continuity  and  consistency 
of  neighber ing  ed.»e  points. 

An  edge  with  two  points  pasting  t, trough  the 
central  2  by  2  region  may  consistently  be  extended 
in  three  ways  at  each  end  (Figure  2a).  The  assump¬ 
tion  made  is  that  edges  do  not  change  direction  too 
radically  (i.e.,  by  more  than  45  degrees  per  pixel). 
Thus,  instead  of  looking  for  all  possible  sequences 
of  four  points  through  a  4  by  4  neighborhood,  two 
sets  of  three  continuations  are  all  that  need  be 
examined.  The  directions  of  these  points  are 
required  to  be  compatible  with  the  two-point  edge 
for  them  to  be  considered  as  eligible  for  continuing 
the  edge.  Should  more  than  one  extension  be  found 
at  each  end,  the  best  Is  chosen  based  on  the  grid 
angle  between  the  points,  their  directions,  and 
their  magnitudes.  The  best  extension  at  each  and 
is  used  to  adjust  tho  magnitude  and  direction  of 
the  corresponding  central  edge  point. 
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For  one-point  edges  th«r _•  are  five  possible 
extensiono  in  the  A  by  4  neighborhood  (Figure  J.b) . 
The  best  compatible  extenaion  is  used  to  adjust 
the  ed^e  point  here  too.  (Note  that  although  only 
two  extensions  are  compatible  with  the  assumption 
that  edges  bend  gradually,  the  other  three  points 
must  be  examined  to  allow  (or  the  case  where  an 
edge  terminates  inside  the  neighborhood) .  If  no 
compatible  edge  extension  can  be  found,  the  edge 
magnitude  is  decreased. 

The  above  process  is  applied  to  all  A  by  A 
neighborhoods  in  parallel .  It  may  be  iterated  to 
allow  information  to  propagate  along  the  edges. 

The  result  is  a  preferred  edge  through  each  neigh¬ 
borhood.  These  chosen  edges  are  used  to  construct 
the  edge  pyramid. 

The  process  that  actually  constructs  the 
pyramid  is  much  simpler  than  that  which  establishes 
the  best  edges.  For  each  neighborhood.  It  must 
calculate  a  magnitude,  direction,  intercept,  and 
-'irectlon  error  value.  The  magnitude  is  calculated 
as  the  mean  of  the  magnitudes  in  the  A  by  A  nelgh- 
boihood.  The  direction  is  calculated  as  the  mean 
of  the  directions  of  those  points  in  the  neigh- 
bo1  hood  with  non-zero  magnitude.  The  error  term 
is  the  square  root  of  the  sum  of  squares  of 
differences  between  the  individual  directions 
aul  the  mean  direction.  The  intercept  Is  one  of 
four  values,  denoting  the  position  of  the  maximum 
mi.  .nitude  point  in  the  2  by  2  central  region  of 
the  neighborhood.  The  values  arc  only  calculated 
for  a  region  if  an  edge  actually  paases  through 
the  central  2  by  2  region.  These  values  are  suffi¬ 
cient  to  reconstruct  the  edge  to  within  the  error 
tolerance.  Other,  more  complex  pyramid  construc¬ 
tion  methods  could  be  Implemented .  For  example, 
it  would  be  possible  to  use  information  high  up 
in  the  pyramid  to  alter  decisions  made  earlier. 
Because  a  decision  about  the  edge  through  a  quad¬ 
rant  la  made  based  only  on  local  information,  it 
might  be  found  that  a  different  decision  would  have 
made  the  edges  higher  up  in  the  pyramid  more  con¬ 
sistent.  By  backtracking  to  lower  levels  and 
altering  the  decisions  made  there,  perhaps  a  more 
informed  result  would  be  obtained. 

Note  that  more  than  one  edge  might  pass 
through  a  neighborhood.  In  this  case,  it  can  be 
expected  that  the  error  term  will  be  large,  and 
the  reconstruction  leas  reliable.  Reconstructing 
a  level  from  its  successor  Is  also  a  simple  process, 
although  rophict tested  and  complex  methods  could 
be  devised  that  would  produce  Improved  reconstruc¬ 
tions.  The  method  that  was  Implemented  makes  no 
use  of  the  error  term,  end  does  not  require  adja¬ 
cent  neighborhoods  in  the  reconstructed  image  to 
be  consistent.  The  process  simply  sxpands  each 
point  to  a  A  by  A  neighborhood,  and  assigns  the 
naan  magnitude  and  direction  stored  at  the  point 
co  each  edge  point  in  the  expanded  neighborhood. 
Points  are  chosen  as  edge  points  by  requiring  the 
edge  to  pass  through  the  Intercept  point,  and  to 
lie  along  the  assigned  direction.  The  reconstruc¬ 
tions  that  result  are  usually  very  reasonable, 
with  the  errors  occurring  only  where  there  are 
sharp  corners,  or  where  edges  are  close  together. 


The  results  can  often  be  improved  by  applying  the 
best  edge  routing  discussed  above. 

Examples 

The  edge  images  for  the  examples  presetted 
below  were  obtained  applying  a  zero-crossing  edge 
detector  (Harr  &  Hildreth,  [11])  to  grey  level 
images.  The  edge  detector  returns  magnitude  and 
direction  values  for  points  corresponding  to  zero 
crossings  in  the  Laplacian  or  second  directional 
derivative  of  the  image  Intensity.  The  aero 
crossings  are  approximated  by  the  zero  points  of 
the  difference  between  two  Cuassian-llke  functions 
with  different  standard  deviations.  The  Laplaclun 
Is  calculated  using  the  hierarchical  discrete 
correlation  method  of  Burt  ([3]).  For  each  zero 
crossing  point,  e  5  by  5  Prewitt -like  operator  is 
used  to  give  magnitude  end  direction  information. 

The  advantages  of  using  a  t iro-crossing  detector 
are  that  the  edges  are  thin  and  the  boundaries  are 
closed  curves.  The  edge  pyramid  process  will, 
however,  work  for  any  edge  or  curve  Image. 

The  first  example  shows  the  entire  process  in 
a  step  by  step  way,  using  a  binary  image  of  a 
square.  For  more  Images,  only  the  magnitude  values 
produce  meaningful  pictorial  displays,  and  only 
these  Images  will  be  shown  in  later  examples.  In 
all  the  examples,  the  results  of  a  16-fold  com¬ 
pression  and  a  subsequent  reconstruction  are  shown. 

Figure  3a  shows  a  binary  image  of  a  bright 
square  of  size  32  by  32  centered  in  a  6A  by  6A 
Image.  Figure  3b  shows  the  result  of  applying  the 
zero-crossing  edge  detector  to  the  image,  while 
Figure  3c  shows  the  Image  resulting  from  one  itera¬ 
tion  of  the  best  edge  procedure.  In  both  cases, 
the  top  image  is  the  direction  image,  and  the 
bottom  image  is  the  magnitude  image,  both  thres- 
holded  so  chat  all  non-zero  points  are  displayed. 
After  the  best  edge  procedure  has  been  applied,  the 
directions  of  neighboring  points  are  mote  consistent. 
This  in  the  reason  for  the  slight  lengthening  of  the 
two  short  line  segments  at  the  bottom  of  the  second 
direction  image . 

Figure  3d  showa  the  four  32  by  32  images  pro¬ 
duced  by  the  edge  pyramid  program.  The  images  in 
the  bottom  row  are  the  magnitude  (left)  and  the 
intercept  (right)  of  each  point.  Those  In  the  top 
row  are  the  direction  image  (left)  and  the  error 
in  direction  (right).  All  images  are  threaholded 
so  that  non-zero  points  are  displayed .  The 
important  thing  to  notice  here  la  the  error  image. 
The  only  places  at  which  errors  in  direction  are 
detected  are  the  comers. 

Figure  3e  shows  the  results  of  applying  on- 
iteration  of  the  best  edge  procedure  to  the  32  by 
32  Images,  and  Figure  3f  shows  the  16  by  16  Images 
produced  by  the  pyramid  process.  The  reconstruction 
algorithm  la  applied  to  the  16  by  16  images  In 
Figure  3g,  and  to  the  reconstructed  32  by  32  images 
in  Figure  3h.  The  results  of  the  reconstruction 
illustrate  the  ability  of  the  edge  pyramid  to 
retain  full  information  in  regions  where  there  are 
long  consistent  lines.  It  is  only  at  tha  comers 
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that  Information  la  lost,  and,  In  this  case,  a 
simple  extension  algorithm  could  be  applied  to 
restore  the  corners. 

Figure  4  shows  the  results  of  running  the 
whole  process  on  the  edges  produced  from  a  gray 
level  Image  of  a  tank  (Figure  4a).  The  original  edge 
magnitude  and  the  enhanced  magnitudes  are  shown  in 
Figure  4b.  Figure  4c  shows  the  first  level  of  the 
pyramid,  the  best  edges  found  at  this  level,  and 
the  second  level  of  the  pyramid.  The  first  level 
of  reconstruction  and  the  final  reconstructed 
image  are  shown  in  Figure  4<I.  Figure  4e  shows  the 
result  of  thresholding  the  magnitude  images,  all  at 
the  same  threshold  value.  It  is  clear  that  the 
important  information  has  been  retained.  A  simi¬ 
lar  sequence  is  shown  in  Figure  5,  starting  from  a 
gray  level  Image  of  part  of  an  airport. 

Figure  6,  finally,  shows  what  happens  when  an 
image  has  sharp  corners  and  inconsistent  edges 
that  occur  close  together.  While  the  result  of 
running  the  process  is  still  clearly  recognizable, 
the  edges  have  been  broken  up  Into  very  short 
segments  at  the  corners,  and  the  reconstructed 
Image  Is  of  clearly  Inferior  quality  to  the  ori¬ 
ginal. 

EDGE  QUADTREES 

A  quadtree  is  obtained  from  a  binary  image 
by  successive  subdivision  into  quadrants.  If  the 
original  Image  is  homogeneous,  a  single  leaf  node 
is  created.  Otherwise,  the  image  is  divided  into 
four  quadrants,  which  become  sons  of  the  root  node. 
This  process  is  applied  recursively  until  all 
terminal  nodes  are  homogeneous.  Binary  quadtrees 
have  been  shown  to  be  useful  in  representing  large 
Images  compactly,  and  many  algorithms  have  been 
developed  for  efficiently  applying  image  processing 
techniques  to  Images  represented  by  quadtrees 
([2],  [5],  (61,  19],  [10],  (19-291). 

For  gray  level  images,  a  class  of  quadtrees 
can  he  defined,  based  on  the  brightness  character- 
lstrlcs  of  the  image.  The  root  node  represents 
the  whole  image,  and  typically  stores  the  average 
gray  level.  If  the  image  ia  sufficiently  homo¬ 
geneous  (l.e.,  if  the  variance  in  gray  level  is 
not  too  great)  no  subdivision  is  performed.  Other¬ 
wise,  the  Image  is  divided  into  four  subimages, 
and  four  children  of  tne  root  are  constructed.  As 
long  as  the  variance  in  any  quadrant  la  higher  than 
a  threshold,  the  process  is  recursively  applied. 

The  result  is  a  tree  that  represents  the  image  to 
a  degree  of  accuracy  dependent  on  the  threshold. 

By  changing  the  threshold  on  the  variance,  a  whole 
class  of  gray  level  quadtrees  can  be  constructed. 
More  generally,  a  clees  of  quadtrees  can  be  con¬ 
structed  using  piecewise  polynomial  fits  to  the 
data  in  each  quadrant.  Gray  level  quadtrees  are 
useful  for  image  smoothing  (Ranade  &  Shneier, 

[17]),  shape  approximation  (Ranade  et  al.,  [IS]), 
and  segmentation  (Ranade,  [16]  Wu  et^  al . ,  [33]). 


Edge  quadtrees  are  similar  to  the  trees  des¬ 
cribed  above,  except  that  the  information  stored  at 
each  node  includes  a  magnitude,  a  direction,  an 
intercept,  and  a  directional  error  term.  All  the 
information  about  the  edge  that  la  stored  is  used 
In  constructing  the  quadtree.  As  in  the  gray  leve.. 
quadtrees,  a  class  of  edge  quadtrees  can  be  defined 
based  on  the  directional  error  term. 

The  construction  of  an  edge  quadtree  proceeds 
by  first  examining  the  magnitude  term,  then  the 
direction  and  direction  error  terms,  and  then  the 
intercept  term.  If  a  node  has  a  sufficiently  low 
magnitude  (i.e,,  no  edge  exists),  then  no  subdivi¬ 
sion  is  performed.  Similarly,  if  the  error  term  is 
sufficiently  small,  no  subdivision  is  performed, 
with  one  exception,  as  fallows.  It  may  happen  that 
a  number  of  parallel  edge  segments  run  through  a 
quadrant,  so  that  the  direction  term  Is  consistent 
with  the  data,  and  the  error  is  low.  Thus,  a 
division  based  only  on  the  error  would  not  be  per¬ 
formed.  It  is  clear,  however,  that  the  quadrant 
does  not  represent  the  data  at  a  sufficient  level  of 
detail  to  enable  tho  set  of  parallel  segments  to 
be  reconstructed.  Thus,  whenever  the  error  term 
falls  below  threshold,  a  further  check  must  be  made 
to  ensure  that  the  intercept  points  in  the  quadrant 
are  consistent  (i.e.,  they  lie  along  a  line  in  the 
direction  of  the  direction  term).  Should  this  r.of 
be  the  case,  the  quadrant  must  be  subdivided.  A 
final  requirement  for  an  edge  quadtree  is  a  flag 
that  is  turned  on  should  an  edge  terminate  within 
a  quadrant.  Tn  this  case,  the  intercept  will  be 
the  point  at  whicti  the  edge  terminates.  The 
result  of  applying  this  process  recursively  to  the 
image  is  a  quadtree  in  which  long  edges  that  are 
nearly  straight  give  rise  to  large  leaves,  or  a 
succession  of  large  leaves.  Near  corners,  or 
where  edges  intersect,  much  smaller  leaves  a.e 
needed,  perhaps  as  small  as  Individual  edge  elements. 
A  feature  of  edge  images  is  the  low  percentage  of 
points  that  contain  Interesting  information.  This 
means  that  an  edge  quadtree  can  be  expected  to 
contain  many  large  leaves  where  there  are  no  edges 
nt  all. 

Figure  7  shows  the  edge  magnitude  image  for  the 
airplane  picture  of  Figure  6,  and  the  levels  of 
the  edge  quadtree  that  are  not  empty  (levels  0,  1, 

2,  and  3).  Note  that  the  leives  are  upright  square 
blocks  in  fixed  positions  and  that  the  shape  of 
the  quadtree  is  dependent  on  the  global  co-ordinate 
system  of  the  image.  This  is  characteristic  of  all 
quadtrees.  It  is  one  of  the  major  differences 
between  this  representation  and  the  strip  trees 
proposed  by  Ballard  [1]  (Section  4). 

One  of  the  advantages  of  the  edge  quadtree  is 
that  it  can  be  used  to  represent  an  image  that  may 
contain  more  than  one  curve.  Of  course,  a  separate 
quadtree,  perhaps  smaller  than  the  image  quadtree, 
can  be  used  to  represent  each  curve,  but  it  le  pre¬ 
ferable  to  use  a  single  quadtree,  both  in  order  to 
maintain  registration  with  the  image  and  for 
compactness.  Each  curve  in  the  tree  can  be  r.amad , 
and  all  terminal  nudes  representing  part  of  a  curve 
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c«n  be  marked  with  the  name  of  the  curve.  In  Many  operations  commonly  applied  to  linear 

addition,  region-like  information  can  be  made  avail-  data  are  facilitated  by  the  quadtree  representation, 
able  in  the  same  structure,  simplifying  the  inter-  For  example,  to  find  the  length  of  a  curve  segment, 

actions  between  regions  and  linear  features.  fcb*  tree  is  traversed  starting  at  the  root,  and 

Notice  that  the  quad true  for  a  closed  curve  and  looking  at  nodes  until  the  first  leaf  node  belonging 

that  for  the  rtgion  enclosed  by  the  curve  have  to  the  segment  is  found.  The  length  contributed  by 

c'.ostily  related  shapes.  this  n°de  is  calculated  as  the  length  of  a  Hu 

through  the  intercept  with  direction  given  by  the 
If  only  one  linear  feature  is  represented  by  t  direction  component,  and  bounded  by  the  node's 
quadtrte,  many  operations  become  very  efficient.  borders.  To  find  the  reBt  of  the  nodes  in  the  curve, 

If  mote  than  one  curve  is  represented,  however,  some  the  FINL-NEIGHBOR  and  FIND-CORNER  algorithms  defined 

of  thu  operations  need  to  be  done  at  a  higher  re-  by  (Semet,[27])  are  used.  They  are  applied  on  each 

solution,  in  order  to  ensure  that  the  correct  curves  side  in  the  direction  given  by  the  node's  direction 

are  involved.  A  way  of  alleviating  this  problem  is  component.  Nodes  that  are  further  away  from  the 
to  assign  the  names  of  curves  passing  thiough  a  leaf's  direction  than  is  allowed  by  the  error  toler- 

quadiant  to  non -terminal  as  well  as  terminal  ance  can  be  ignored.  The  lengths  of  the  nodes 

nodes.  A  bound  has  to  be  put  cn  the  number  of  found  in  this  way  *«  added  to  the  curve  length,  if 

names  allowed,  however,  because  this  may  not  be  they  have  the  correct  name.  Each  new  node  after  the 

limited.  All  non-terminals  that  have  more  named  first  will  have  at  most  one  successor.  When  no  more 

curved  segments  passing  through  them  than  the  neighbors  can  he  found,  the  1  etvgth  has  been  calcu- 

bound  allows  can  be  flagged.  When  curve  operations  lated.  For  closed  curves,  the  original  node  must 
Involving  flagged  nodes  are  executed,  the  descen-  be  flagged  to  ensure  that  the  process  terminates, 

dants  of  the  flagged  nodes  must  be  examined  recur¬ 
sively  to  find  the  first  one  with  the  required  names.  Other  operations  are  easily  defined  as  modifi¬ 

cations  of  regular  quadtree  operations.  The  dls- 
Many  operations  that  are  useful  for  manipu-  tance  from  a  point  to  a  curve  can  be  calculated 

lating  edge  and  curve  information  can  be  implemented  using,  a  variant  of  the  distance  transform  algorithm 

efficiently  using  edge  quadtrees.  Moat  of  the  (Samet,  [25]).  Instead  of  finding  the  distance 

algorithms  are  adaptions  of  quadtree  algorithms  for  from  a  point  (or  the  center  of  a  BLACK  node)  to  the 
region  representation.  Only  the  broad  outlines  and  nearest  WHITE,  or  boundary  point,  the  distance  to 

necessary  modifications  will  be  given  here.  the  nearest  curve  point  is  found.  Other  algorithms, 

like  union  and  intersection  (Hunter  &  Stelglltz  [9], 
First,  an  algorithm  is  presented  for  naming  Shntler  [28]),  require  almost  tio  alteration, 

each  curve  in  a  quadtree.  It  is  basal  directly  on 

the  algorithm  for  finding  connected  components  of  Interactions  between  region  and  boundary  infor- 

au  image  represented  by  a  quadtree  (Samet,  [21]).  nation  are  natural  in  this  representation  because 

First  note  that  a  leaf  node  can  have  no  more  than  °F  the  registration  of  the  images.  Operations  like 

nine  different  curves  passing  through  it.  This  is  the  Superslice  algorithm  (Milgram  [  12])  can  be  per- 

the  number  of  "smooth"  continuations  through  a  totmai  on  the  quadtrees  by  taking  advantage  of  the 

sequence  of  three  pixels,  assuming  less  than  e  90  infoimation  contained  in  the  shapes  of  the  trees, 

degree  change  of  angle  between  pixels.  Thus  the  T^e  Juperslice  algorithm  attempts  to  find  the  best 

number  of  names  at  a  leaf  is  bounded.  (It  is  also  segmentation  of  an  image  by  matching  edge  and  region 

necessary  to  assume  that  two  or  more  curves  cannot  information.  A  number  of  thresholds  are  applied 

have  arcs  in  common).  to  the  Image,  giving  rise  to  various  new  images. 

Each  of  these  images  is  matched  with  the  edge  image. 
The  algorithm  involves  three  phases.  The  first  and  that  with  the  beet  teg lon/boundary  fit  is 

pass  assigns  names  to  ench  curve  node  in  the  quad-  chosen  as  the  segmented  image, 

tree  by  starting  at  the  North-Weat.  corner  of  the 

tree,  and  examining  the  South  and  East  neighbors  of  In  the  quadtree,  this  operation  can  be  simpli- 

the  curve  nodes.  If  the  direction  of  a  neighbor  is  fied  and  made  more  Intelligent.  Starting  with  the 
compatible  with  rbe  node  (i.e.,  within  the  error  ®dge,  and  noting  that  the  shape  of  the  region 

tolerance)  and  its  intercept  is  also  compatible  quadtree  is  constrained  by  that  of  the  edge  quad- 

(i.e.,  lines  up  nlong  the  common  direction  with  the  tree.  a  rl**a  of  candidate  thresholds  can  be  stored 

node's  Intercept),  the  neighbor  is  given  the  same  at  *acb  leaf  node  in  the  quadtree,  each  candidate 

name.  Otherwise,  a  new  name  is  assigned.  If  a  node  giving  rise  to  a  region  subtree  with  a  shape  consis¬ 
ts  found  that  has  already  been  named,  and  the  node  tent  with  the  edge  subtree.  If,  aftar  all  the  nodes 

is  compatible,  an  equivalence  io  established  between  have  had  candidates  assigned,  there  is  a  single 
the  nodes.  threshold  that  gives  the  correct  shape  (i.e.,  the 

same  threshold  appears  at  all  nodes) ,  that  thres- 
The  second  phase  processes  the  equivalent  pairs  hold  will  give  the  required  segmentation.  Other - 
to  produce  equivalence  classes,  while  the  third  •  number  of  local  thresholds  may  be  applied  to 

and  final  phase  traverses  the  tree  again,  and  sub  images,  cr  an  approximation  to  the  segmentation 

assigns  a  single  name  to  all  members  of  an  aquiva-  can  be  made  by  choosing  a  compromise  thretbold. 

lence  clase.  The  nemes  at  the  leaves  of  the  tree 
can  be  propagated  up  to  the  non-terminal  nodes  at 
the  same  time,  a  nonterminal  node  being  flagged  if 
too  many  names  are  aaalgned  to  it. 


COMPARISON  WITH  OTHER  METHODS 


DISCUSSION 


Anothur  hierarchical  quadrant-baaed  Bathed 
for  representing  edges  is  that  of  Omolayole  and 
Klinger  ([13]).  They  recursively  subdivide  an 
edge  image  into  quadrants  down  to  a  2  by  2  levels 
A  number  of  edge  patterns  are  then  sought  in  each 
subquadrant,  and,  if  too  few  of  these  are  found, 
the  quadrant  is  discarded.  The  reault  ia  a  kind 
of  tree  structure,  with  the  weaves  containing 
teaplate-like  representations  of  the  edge  data 
ip.  then.  The  twin  aim  of  this  method  seams  to 
be  to  discard  the  steas  of  the  image  that  contain 
little  or  no  information.  For  edge  images,  this 
can  be  expected  to  save  fairly  large  amounts  of 
storage.  The  edge  quadtree  differs  from  this 
approach  in  its  treatment  of  quadrants  containing 
edge  information.  Instead  of  having  fixed-slxed 
leaves,  the  quadtree  allows  leaves  to  be  of  the 
largest  size  consistent  with  the  edge  intonation 
they  represent. 

Othsr  methods  that  have  been  devised  for 
representing  linear  information  are  the  upright 
rectangles  of  Burton  ([4]),  the  atrip  trees  of 
Ballard  ([1]),  and  the  chain  codes  of  Freeman 
([18]). 

Freeman  has  developed  one  of  the  most  compact 
and  well-known  boundary  representations,  called 
chi  in  codes.  These  codes  represent  the  relsliv* 
grid  positions  of  successive  line  points  in  a 
digital  image.  They  are  perhaps  not  as  well  suited 
to  representing  edge  information  as  line  informa¬ 
tion,  but  have  the  advantages  of  being  compact  t  .i 
ere  not  tied  to  any  particular  co-ordinate  system. 

Burton  presented  a  method  of  representing 
polygonal  lines  using  a  aeries  of  upright  rectang¬ 
les.  His  work  was  extended  by  Ballard,  who  de¬ 
fined  a  representation  for  curve  information  called 
strip  trees.  A  strip  tree  is  a  representation  of 
a  curve,  obtained  by  successively  approximating 
parts  of  the  curve  by  enclosing  rectangles.  The 
structure  ia  a  binary  tree,  with  the  root  node 
representing  the  bounding  rectangle  of  the  while 
curve.  This  rectangle  ia  broken  into  two  parts 
at  a  point  of  maximum  distance  from  the  line  join¬ 
ing  the  endpoints  of  the  curve.  The  two  parts  are 
children  of  the  root,  and  may  be  recursively  sub¬ 
divided  until  an  error  bound  1b  satisfied.  Note 
that  the  strip  tree  is  not  unique  in  cases  where 
more  than  one  extreme  point  exists. 

All  these  other  representations  are  able  to 
represent  only  single  curves,  whil*  the  edge  quad¬ 
tree  representation  ia  able  to  represent  several 
curves  In  the  asms  tree.  The  edge  ti»<»dtr*«a  and 
edge  pyramids  are  alao  in  registration  with  the 
image,  and  with  ragion-basad  representations  like 
ordinary  quadtrees  and  pyraaida.  This  gives  them 
a  further  advantage  over  the  other  representations. 
Where  the  other  methods,  end  particularly  the 
chain  code,  gain  over  the  edge  quadtree  is  in 
compactness,  although  the  edge  quadtree  is  actually 
storing  more  information  than  the  other  methods, 
and  may  give  rise  to  better  reconstruction  of  edge 
information. 


Thera  are  two  mein  reasons  for  developing 
pyr Hilda  for  linear  features.  The  first  la  to  pro¬ 
vide  compression  of  tha  data,  for  example  to  allow 
linear  taaturaa  to  be  detected  from  low  resolution 
images  in  a  hierarchical  image  data  baae,  thus 
reducing  tha  number  of  full  reeolutlon  images  that 
need  to  be  txemlned.  The  seconu  reason  la  to 
enable  the  moet  prominent  edge  featuree  (or  the 
edge  featuree  larger  then  e  given  else)  to  be 
extracted  from  the  image,  and  to  discard  smaller 
features. 

It  would  be  impractical  to  search  e  large  image 
date  baae  for  the  existence  of  an  object  with  a 
sut  of  known  featuree.  Rather,  it  would  be  ueeful 
to  be  able  to  filter  out  most  of  the  images  on  the 
basis  of  groaa  tasta,  leaving  only  a  few  to  bo 
examined  mm  cloaely.  For  region-  or  blob-like 
features,  thla  ability  can  be  provided  by  grey-level 
pyramids  or  quadtrees.  While  the  most  natural  form 
for  storing  linear  information  la  probably  a  linked 
list,  it  is  daairable  for  uniformity  to  store  line¬ 
ar  feature  information  in  a  similar  way  as  regional 
feature  information.  This  facility  la  provided 
by  the  edge  pyramids  and  edge  quadtrees  presented 
In  this  paper.  Of  course  the  representation!  ere 
useful  not  only  for  edges,  but  for  any  linear 
Information . 

Tha  second  reason  for  building  an  edge  pyramid 
is  to  enuble  no lay  edges  end  edges  that  are  too 
smell  to  be  filtered  out  of  the  Image.  In  fact, 
this  is  achieved  in  two  ways,  both  through  the 
lateral  best  edge  process  and  through  the  pyramid 
process.  The  best  edge  process  la  not  designed 
specifically  to  enhance  good  edges  and  suppress 
bad  oneB,  but  it  has  this  effect  except  where  two 
edges  intersect  or  two  edges  pass  very  close  to 
each  other.  The  pyramid  process  causes  short  seg¬ 
ments  to  be  lost  high  in  the  pyramid  because,  after 
a  while,  they  fall  to  find  good  continuations.  Note, 
though,  that  short  broken  edge  segments  could  be 
Joined  together  if  the  gaps  were  sufficiently  asutll 
relative  to  the  image  resolution.  Tha  main 
requirments  for  an  edge  to  continue  to  exist  at 
successively  higher  levels  in  the  pyramid  are  con¬ 
sistency  and  good  continuation. 

CONCLUSIONS 

Two  related  representations  for  linear  infor¬ 
mation  have  been  presented.  Edge  pyramids  have 
been  shown  to  be  able  to  store  the  important  edge 
Information  in  an  image,  even  at  fairly  low  reso¬ 
lution,  with  the  ability  to  reconstruct  images 
that  look  very  much  like  the  originals.  The  main 
loss  in  informswlou  is  at  intersections  of  edges, 
or  where  edges  pass  close  to  each  other.  Small 
edgea,  usually  representing  noise,  are  also  lost. 

Edge  quadtrees  have  been  presented  as  an 
alternative  hierarchical  representation  for  linear 
feature  information,  with  the  ability  to  represent 
the  information  at  variable  resolution,  depending 
on  the  local  consistency  of  the  data.  The 
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advantages  of  «dg«  quadtree*  over  other  representa¬ 
tion*  are  their  ability  to  repraeent  aore  than  one 
curve  In  a  single  structure,  and  their  registra¬ 
tion  with  the  lmag*  and  with  other  region-based 
representations  for  Images. 
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Figure  1.  Two  neighborhoods  used  in  constructing 
pyramids.  The  central  ?.  by  2  regions 
are  disjoint,  but  each  neighborhood 
shares  rows  or  columns  with  its  neigh¬ 
bors. 


Figure  2.  The  ways  in  which  sdge  points  may  be 
extanded.  a.  For  a  two-point  edge, 
there  are  six  con*' "tent  continuations, 
t.  For  a  one-poitu.  -ge,  thera  are  five 
conalatent  continuations. 
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Figure  3; 

(ft)  (h) 


Figure  3.  The  pyramid  process  applie..  to  a  64  by  64 
binary  image,  a.  Original  image  of  a 
square  region,  b.  The  magnitude  (bottom) 
and  direction  (top)  images  produced  by  a 
zero-crossing  edge  detector,  c.  The  mag¬ 
nitude  and  direction  images  after  running 
the  best  edge  procedure,  d.  The  first 
pyramid  level  (32  by  32):  magnitude 
(bottom  left);  Intercept  (bottom  right); 
direction  (top  left);  error  (top  right), 
e.  The  result  of  applying  the  best  edge 
procedure  to  the  32  by  32  Images,  f.  The 
second  pyramid  level  (16  by  16).  g.  The 
result  of  reconstructing  a  32  by  32  image 
from  the  16  by  16  pyramid  level,  h.  The 
result  of  constructing  a  64  by  64  Image 
from  the  32  by  32  reconstructed  image. 


Figure  4e 


Figure  4.  The  pyramid  process  applied  to  a  gray 

level  image,  a.  A  FLIR  image  of  a  tank, 
b.  The  magnitude  and  enhanced  magnitude 
of  the  edge  image  of  the  tank  (thresh- 
olded  so  that  non-zero  points  are  dis¬ 
played).  c.  The  first  pyramid  level 
magnitude,  the  enhanced  magnitude,  and 
the  second  pyramid  level  magnitude,  d. 
The  first  and  second  level  reconstructed 
images,  e.  The  same  process  as  in  a  and 
b  above,  but  with  all  Images  chresholded 
at  the  same  level. 


97 


Figure  5: 
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Figure  b.  The  pyramid  and  reconstruction  process 

applied  to  a  binary  image  of  an  airplane. 


Figure  5.  The  process  described  in  Figure  4  ap- 

P~ ;ed  to  an  image  of  part  ol  an  airfield. 

lowest  level  (level  0)  of  the  edge  quad¬ 
tree  (individual  pixels),  c.  Level  1, 
having  2  by  2  blocks  of  pixels,  d.  Level 
2,  having  4  by  4  blocks,  e.  Level  3, 
having  8  by  8  blocks. 


Figure  7.  The  edge  quadtree  of  an  airplane  image. 

a.  The  edge  magnitude  image,  b.  The 
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ABSTRACT 

A  method  of  evaluating  edge  detector  out¬ 
put  is  proposed,  baned  on  the  local  good  form 
of  the  detected  edges.  It  combines  two  desir¬ 
able  qualities  of  well-formed  edges  —  good 
continuation  and  thinness.  The  measure  has  the 
expected  behavior  for  known  input  edges  as  a 
function  of  thelv  blur  and  noise.  It  yields 
results  generally  similar  to  those  obtained  with 
measures  baned  on  discrepancy  of  the  detected 
eager  from  their  known  ideal  positions,  but  it 
has  the  advantage  of  not  requiring  ideal  posi¬ 
tion  to  be  known.  It  can  be  used  as  an  aid  to 
threshold  selection  in  edge  detection  (pick  the 
threshold  that  maximizes  the  measure),  as  a  basis 
for  comparing  the  performances  of  different 
detectors,  and  as  ■  -  ’  •'  i  ..•*  ^he  ef  fectiveness 

or  various  -yp«a  of  preprocessing  iterations 
facilitating  edge  detection. 


INTRODUCTION 


Trie  concept  of  an  edge  is  a  difficult  one 
to  define  precisely.  The  stimulus  conditions 
that  cause  the  perception  of  an.  edge  by  humans  are 
by  no  means  simple  to  describe.  There  are  many 
well  known  visual  paradoxes  In  which  an  edge  is 
clearly  seen  where  none  physically  exists.  (See 
for  example  Comsweet  (1974) Dember  (1966]  ,  or 
Gregory  (1974).)  In  the  analysis  of  images  by 
computer,  exactly  what  constitutes  ar>  edge  depends 
greatly  on  the  objectives  of  the  analysis. 

Keeping  the  above  in  mind,  we  can  nonetheless 
regard  an  edge  as  the  boundary  between  two  adja¬ 
cent  regions  In  an  Image,  each  region  homogeneous 
within  itself,  but  differing  from  the  other  with 
respect  to  some  given  local  property.  Thus  an 
edge  should  ideally  be  line-like. 

In  this  paper  we  restrict  our  attention  to 
the  simplest  case,  brightness  edges,  although  the 
ed’.e  evaluation  techniques  we  present  below  are 
applicable  to  color  or  texture  edges  as  well. 
Brightness  edges  in  an  Image  have  i.u*a/  possible 
causes  In  the  original  scene:  discontinuities  In 


surface  properties  (such  as  reflectance) ,  In  sur¬ 
face  orientation,  in  Illumination  (shadows,  for 
example)  or  in  depth  (causing  occlusion  of  one 
surface  by  another).  However,  the  interpretation 
of  the  cause  of  an  edge  will  not  concern  ua  here. 

Brightness  edges  (henceforth  just  edges)  are 
important  features  of  image  analysis,  and,  accordingly, 
many  schemes  have  been  devised  for  detecting  them. 

Here  we  are  concerned  chiefly  with  so-called 
enhancemant/thresholdT.ig  edge  detectors:  In  the 
enhancement  step,  an  operator  which  computes  local 
brightness  differences  is  applied  to  an  image. 

Such  an  operator  will  have  a  high  response  when 
positioned  on  the  boundary  between  two  regions,  but 
little  or  no  response  within  each  region.  (The 
operators  discussed  below  also  compute  an  estimate 
of  the  direction  of  brightness  change.)  In  the 
next  step,  the  edges  in  the  image  are  extracted  by 
suitably  thresholding  the  operator  output.  The 
final  result  of  processing  Is  a  binary  picture, 
pixels  deemed  to  be  on  an  edge  (edge  pixels)  having 
the  value  1,  all  others  (non-edge  pixels)  have  the 
value  0. 

It  is  of  interest  to  evaluate  the  quality  of 
the  output  of  an  edge  detector,  both  to  compare  one 
detector  scheme  with  another,  end  also  t~  “tudy 
the  behavior  of  a  given  detector  under  different 
conditions  and  parameter  settings.  Several  authors 
have  proposed  techniques  for  edge  evaluation.  In 
the  next  section  we  review  their  work. 


SURVEY  OF  PREVIOUS  WORK 


Fram  and  Deutsch  (1974,  1975)  studied  the 
effect  of  noise  on  various  edge  detector  schemes. 

For  this  purpose  they  used  synthetic  images  composed 
of  three  vertical  panels.  The  outer  panels  were  of 
two  different  grey  levels;  the  narrow  inner  panel 
Interpolated  between  these  grey  levels.  It  was 
considered  that  the  position  of  the  edge  was  defined 
by  this  central  panel  and  only  here  should  an  edge 
detector  respond.  Images  were  generated  for  a 
number  of  different  levels  of  contrast  between  the 
two  outer  panels,  and  to  each  Image  was  added 
identically  distributed  zero-mean  Gaussian  noise. 

Several  different  edge  enhancement  techniques 
were  applied  to  these  test  images,  and  thresholds 
were  chosen  so  that  the  number  of  detected  edge 
points  was  as  close  as  possible  to  the  number  of 
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points  expected  for  a  well-found  edge,  based  on 
inspection  of  a  sample  of  detector  outputs.  'Ihe 
threshold  output  was  evaluated  according  to  two 
measures.  The  first,  P, ,  estimated  what  fraction 
of  the  detected  edge  pixels  were  actually  edge 
points.  The  second,  P2,  estimated  what  fraction 
of  the  vertical  extent  of  the  edge  was  covered  by 
detected  edge  pixels.  These  estimates  are  possible 
because  it  was  known  that  edge  pixels  actually  due 
to  the  edge  could  be  found  only  in  the  central 
panel,  and  it  was  assumed  that  edge  pixels  due  to 
noise  could  be  uniformly  distributed  throughout 
the  image. 

As  would  be  expected,  the  experimental  results 
showed  that  edge  detector  performance,  as  measured 
by  ?!  and  P2 ,  improves  when  contrast  is  increased 
relative  to  noise.  They  also  demonstrated  that 
some  edge  detection  schemes  perform  consistently 
better  than  others. 

While  their  measures  are  directly  applicable 
only  to  vertical  edges,  Fram  and  Deutsch  also 
performed  experiments  with  synthetic  oblique 
edges.  They  did  this  by  the  expedient  of  numeri¬ 
cally  rotating  the  enhancement  output  until  it 
corresponded  to  a  vertical  edge.  It  could  then 
be  thresholded  and  evaluated  as  if  it  had  been 
vertical.  By  this  means  they  examined  the  sensi¬ 
tivity  of  the  detectors  they  used  to  edge  orienta¬ 
tion. 

The  approach  of  Abdou  and  Pratt  [1979]  i>. 
more  analytic.  (See  also  Abdou  [1978]  .)  Using  a 
simple  model  for  the  digitization  of  a  straight 
edge  passing  through  the  center  of  an  operator's 
domain,  they  goemetr ically  analyzed  the  sensitivity 
of  a  number  of  edge  enhancement  operators  to  the 
orientation  of  the  edge.  They  similarlly  analyzed 
the  Pell-off  of  operator  response  with  displace¬ 
ment  from  the  center  of  the  domain  for  straight 
edges  with  vertical  and  diagonal  orientations. 

They  described  a  statistical  design  proce¬ 
dure  fut  thresnolu  selection  In  noisy  Images  with 
vertical  and  diagonal  edges.  Using  additive 
Gaussian  noise  as  an  example,  they  derived  the 
conditional  probability  distributions  of  operator 
response  for  a  numbe.  of  enhancement  operators, 
given  the  existence  or  non-existence  of  an  actual 
edge.  They  could  thus  compute  for  each  operator 
the  probabilities  of  correct  and  false  detection 
as  a  function  of  threshold  and  of  noise  level. 

By  this  means  they  showed  the  superiority  of  some 
detection  schemes  over  others.  They  also  presented 
a  pattern-classification  approach  to  threshold 
selection  using  training  samples  of  edge  and  no-edge 
neighborhoods,  and  gave  experimental  results  for  a 
number  of  edge  detectors  In  discriminating  edge  from 
non-edge  neighborhoods,  using  this  approach.  These 
results  3how  a  similar  ordering  of  the  quality  of 
the  various  edge  decectio <  schemes. 


More  relevant  to  the  present  paper,  Abdou  and 
Pratt  provided  another  experimental  comparison  of 
the  various  edge  detector  schemes  using  Pratt's 
figure  of  merit  of  edge  quality  [Pratt  1978). 

They  jsed  synthetic  test  images  very  similar  to 
those  of  Fram  and  Deutsch  above.  The  only  difference 
worth  remarking  on  is  that  Abdou  and  Pratt  vary  the 
relative  strength  of  signal  to  nol3e  by  holding  the 
contrast  constant  and  changing  the  standard  devia¬ 
tion  of  the  added  noise.  Pratt's  figure  of  merit 
is  based  on  the  displacement  of  each  detected  edge 
pixel  from  its  ideal  position  (known  from  the  geo¬ 
metry  of  the  synthetic  image),  with  a  normalization 
factor  to  penalize  for  two  few  or  too  many  edge 
points  being  detected.  Its  definition  is: 
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where  I  is  the  actual  number  of  edge  pixels  de¬ 
tected;  is  the  ideal  number  of  edge  pixels 
expected  (known  from  the  geometry  of  the  esthetic 
image);  d(l)  is  the  miss  distance  of  the  1  edge 
pixel  detected;  and  a  is  a  scaling  factor  to  pro¬ 
vide  a  relative  weighting  between  smeared  edges,  and 
thin,  but  offset  edges.  For  these  experiments, 

Abdou  and  Pratt  set.  a-l/9.  Like  Fram  and  Deutach's 
parameters  P4  and  ,  this  figure  of  merit  was 
Implemented  for  vertical  .slges,  but  Abdou  and  Pratt 
also  present  a  modification  of  it  for  diagonal 
edges . 

Unlike  Fram  and  Deutsch,  Abdou  and  Pratt  UBed 
the  less  arbitrary  procedure  of  choosing  thresholds 
so  as  to  maximize  the  figure  of  merit.  The  experi¬ 
mental  results  shoved,  as  one  would  expect,  that 
the  figure  of  merit  declines  with  increased  noise, 
and  also  again  showed  the  superiority  of  certain 
edge  detection  schemes  over  others. 

The  work  of  Bryant  and  Bouldin  [197S]  is 
different  in  several  respects.  They  used  real 
aerial  photographs  Instead  of  synthetic  images. 

Their  threshold  selection  was  based  on  accepting 
a  fixed  upper  percentile  of  the  distribution  of 
enhanced  edge  output.  More  significantly,  they 
proposed  two  quite  distinct  edge  evaluation  mea¬ 
sures.  One,  called  absolute  grading,  is  based  on 
the  correlation  of  the  edge  detector  output,  with 
an  ideal  "key"  output,  this  key  being  determined 
apparently  by  hand.  Their  other  technique,  called 
relative  grading,  is  rather  novel.  Omitting  the 
details,  it  is  based  on  comparing  the  output  of  a 
nuiaber  of  detectors,  and  rating  each  detector  by 
how  often  it  agrees  with  the  consensus  of  the  other 
detectors  In  deciding  whether  an  edge  exists  at 
each  pixel.  By  these  means  they  compared  a  number 
of  edge  detectors,  and  were  able  to  some  extent  to 
quantify  the  improvement  in  edge  output  achieved  by 
such  post-processing  as  edge-linking  and  edge- 
thinning,  They  also  gave  an  example  of  effect  of 
th^e«hold  level  on  the  absolute  grace  of  an  edge 
de  ector. 
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Relative  grading,  while  an  Interesting  Idea, 
suffers  from  a  number  of  weaknesses.  Its  results 
depend  on  the  details  of  the  consensus  determina¬ 
tion  used,  and  on  the  particular  mix  of  operators 
chosen  for  comparison.  Most  Important,  It  Is 
completely  oblivious  to  detection  errors  made  by 
all  detectors,  and  may  even  penalise  a  good 
detector  that  does  not  make  an  error  made  by  a 
majority  of  bad  detectors. 

Aside  from  relative  grading,  all  methods 
discussed  above  require  prior  knowledge  of  the 
location  of  the  actual  edge,  since  they  are  more  or 
less  based  upon  the  discrepancy  between  the  detected 
edge  pixels  and  the  Ideal  position  of  the  edge. 

This  la  fine  for  experiments  with  controlled  syn¬ 
thetic  images,  but  raises  questions  when  applied 
to  reel  images,  since  the  determination  of  edges 
In  such  pictures  Is  very  much  the  subjective 
decision  of  a  human  observer.  The  techniques  are 
completely  Inapplicable  to  images  for  which  the 
actual  ed(je  locations  are  unknown. 

Further,  the  discrepancy  between  detected  and 
Ideal  edge  Is  not  the  sole  determinant  of  the 
quality  of  edge  output.  See  Figure  1.  Here  we 
have  two  detected  edges,  both  of  equal  discre¬ 
pancy  from  the  id  al.  However,  one  of  them  Is 
clearly  preferable,  since  the  detected  edge  Is 
continuous,  rather  than  fragmented.  It  Is  clear 
that  some  attention  should  be  paid  to  the  good 
form  of  the  detected  edge. 

Finally,  none  of  the  above  edge  evaluation 
measures  take  any  account  of  the  edge  direction 
Information  produced  by  most  edge  enhancement 
operators.  This  information  is  used  in  many 
applications  and  is  an  important  consideration 
in  determining  the  good  form  of  an  edge.  Even 
though  a  set  of  edge  pixels  may  lie  in  the  shape 
of  a  well-formed  edge,  something  is  amiss  If  the 
estimated  edge  directions  are  chaotic.  Ideally, 
the  brightness  gradient  direction  should  be 
everywhere  perpendicular  to  the  edge,  and  per¬ 
pendicular  In  the  same  sense. 


LOCAL  EDGE  COHERENCE 


Bearing  in  mind  the  deficiencies  of  the  above 
techniques,  we  havn  developed  an  edge  evaluation 
measure  based  solely  on  the  criterion  of  good 
edge  formation,  without  usins  any  prior  knowledge 
of  Ideal  edge  location.  This  new  measure  is 
intended  as  a  supplement  to  existing  measures,  not 
a  replacement,  since  it  is  clear  that  a  measure 
which  disregards  the  correct  location  of  an  edge 
cannot  be  u  fully  adequate  measure  (although  the 
results  presented  below  show  that  it  is  quite  good) . 
For  example,  an  edge  detector  that  systematically 
mlslocates  edges  will,  by  our  schemes,  receive  an 
evaluation  measure  equal  to  that  of  a  detector 
which  perfectly  locates  edges. 

However,  since  the  new  measure  does  not  require 
prior  knowledge  of  edge  location,  it  can  be  used 
much  more  freely,  in  particular  on  images  for  which 
this  knowledge  is  lacking.  In  addition  to  the 


standard  uses  of  comparing  adge  detector  schemas , 
the  new  meaaure  can  ba  used  for  selecting  and 
adjusting  edge  operators  ee  they  are  applied  to  an 
actual  image.  For  example,  an  adge  detector  thres¬ 
hold  can  be  chosen  so  as  to  maximise  the  edge  evalu¬ 
ation  meaaure.  This  will  be  the  threshold  which 
extracts  the  best-formed  edges.  (This  parallels 
the  work  of  Weaeka  and  Roeenfald  [1978]  on  threshold 
evaluation  for  segmentation  of  regions.  One  of 
their  techniques  rated  a  threshold  level  on  the 
basis  of  the  busyness  of  the  resulting  thresholded 
image.)  In  applications  where  edge  extraction  ia 
an  Important  part  of  the  processing,  the  edge 
evaluation  measure  can  serve  as  an  indication  of 
image  quality. 


The  approach  we  have  used  la  based  on  what  we 
call  local  edge  coherence.  Essentially,  we  examine 
every  three  by  three  neighborhood  of  the  thresholded 
edge  output,  taking  into  account  the  direction 
output  as  veil.  If  the  center  of  the  neighborhood 
is  an  edge  pixel,  then  we  call  the  neighborhood  an 
edge  neighborhood  and  rate  it  on  the  basis  of  two 
criteria,  continuation  and  chinneas,  which  should 
both  be  exhibited  by  a  well-formed  edge  passing 
through  the  center  of  the  neighborhood.  Both  these 
criteria  are  based  on  the  working  definition  of  an 
edge  given  in  Section  1.  It  should  be  locally  llne- 
llke,  with  due  regard  for  the  consistency  of  direction 
of  brightness  changes.  Continuation  requires. 


ideally,  that  adjacent  to  the  central  pixel,  along 
the  edge  (this  is  perpendicular  to  the  gradient 
direction  of  brightness  change  at  the  center) ,  there 
be  two  edge  pixels  with  almost  identical  direction 
which  form  the  continuation  of  that  edge.  Thinness 
requires,  ideally,  that  all  the  other  six  pixels 
of  the  neighborhood  be  no-edge  pixels.  The  continua¬ 
tion  and  thinness  ratings  of  an  entire  edge  output 
can  be  measured  as  the  fraction  of  edge  neighborhoods 
satisfying  these  respective  criteria. 

Of  course,  for  most  images,  very  re w  edge 
>el ghborhooJs  will  perfectly  satisfy  these  tvo 
criteria,  because  of  digitisation  problems  and 
even  slight  noise.  We  therefore  compute  instead 
continuation  and  thinness  scores,  ranging  from  0  to 
1,  with  the  overall  scores  being  averaged  over 
every  edge  neighborhood  in  the  output.  These 
scores  are  designed  to  take  the  value  1  for 
perfectly-formed  edge  neighborhoods,  dropping  off 
only  slightly  for  almost  well-formed  neighborhoods, 
but  falling  eventually  to  0  for  badly  formed 
neighborhoods. 

The  continuation  score  is  computed  as  follows: 

Let  ( cx— B {  represent  the  absolute  difference 
between  two  angles  ct  and  B,  the  difference  ranging 
from  0  ton  radians.  Let 
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Thla  function  ranges  from  1  for  identical  angles 
a  and  8,  linearly  down  to  0  for  angles  that  differ 
by  half  a  revolution,  that  la,  point  in  opposite 
directions.  It  thus  Manures  tho  extent  ho  which 
ttie  two  angles  agree  In  direction. 

Let  us  number  the  neighbors  of  au  edge  pixel 
as  shown  In  Figure  2.  Let  d  stand  for  the  edge 
gradient  direction  at  the  center  pixel,  and  let 
d  ,  dp...,  d?  stand  for  the  edge  gradient  diree- 
tioneat  each  of  the  eight  neighbors  respectively. 
Let 


L(k)  -  s(d,d  )a(p,d+|-)  If  neighbor  k  is 

an  edga  pixel 

■  0  otherwise. 

This  function  measures  how  well  a  neighboring 
pixel  continues  on  the  left  of  the  edge  which 
peases  through  the  central  pixel.  It  la  0  when 
the  neighbor  Is  not  an  edge  pixel,  since  no 
continuation  exists.  When  the  neighbor  is  an 
edge  pixel,  lea  rating  la  ecu* posed  of  two  factors: 
The  first,  a(d,d^),  measures  how  well  the  edge 
gradient  direction  at  the  neighbor  agrees  with 
that  at  the  center.  The  second  factor. 


parameter  Y  can.  be  adjusted  to  give  e  relative 
biasing  of  the  measure  E  In  favor  of  well-connected 
edges  as  against  thin  edges.  The  choice  of  Y  will 
also  he  discus eed  below. 

While  this  approach  to  edge  evaluation  la  a 
little  ad  hoc,  no  simpler  technique  seemed  able  to 
capture  the  notion  of  a  locally  wall-formed  edga. 

We  were  first  led  to  investigate  the  possibility  of 
an  edge  evaluation  meisurn  baaed  on  good  form  by  an 
observation  on  compatibility  coefficients  for  relaxa¬ 
tion  labelling  [Pe*.eg  and  Rouen  fold,  1977]  .  The 
arrays  of  compatibility  coefficients  showed  a  parti¬ 
cular  diagonal  tendency  when  derived  from  Images 
with  clear  edges  which  far  less  pronounced  when 
derived  from  noisy  or  blurred  Images.  We  attempted 
to  devlat  un  edge  evaluation  measure  based  on 
characteristics  of  the  compatlablllty  coefficient 
arraye,  and  later  on  characteristics  of  the  edge 
direction  co-Occurreace  matrices,  which  ere  closely 
related.  Preliminary  experiments  showed  that  none 
of  these  measures  were  satisfactory,  although  they 
suggested  that  a  measure  based  on  good  form  could 
ultimately  be  developed.  Several  techniques  based 
on  local  properties  of  the  edge  output  were  lnvestl- 
geted,  culminating  In  the  method  presented  here. 

This  measure  Is  intuitively  reasonable,  and  more 
important,  performs  quite  well,  as  the  experimental 
results  below  demonstrate. 


measures  how  close  neighbor  k  Is  to  the  expected 
direction  of  leftward  continuation  of  the  ed£,e, 
based  on  the  direction  at  the  center.  The  term 

is  the  direction  to  neighbor  k,  and  the  term 

4  , 

d+  x  la  at  right  angles  to  the  gradiei-.1  direction 
and  therefore  lies  along  the  edge.  Sit.  . Isrly  we 
def ine 

R(k)  -  a(d,<L  )-c(~,d-^)  if  neighbor  k  is 

an  edge  pixel 

■  0  otherwise 

which  measures  how  well  neighbor  k  continuee  the 
edge  toward  the  right. 

Of  the  three  neighboring  pixels  lying  to  the 
left  of  the  central  edge  gradient  direction,  the 
one  with  the  highest  valve  of  L(k)  ia  taken  aa  the 
left  continuation.  Similarly,  of  the  three  pixels 
on  the  right,  the  one  with  the  best  value  for  R(k) 
la  taken  as  the  right  continuation  of  the  edga. 

The  average  of  these  two  best  continuations  13 
taken  as  the  continuation  measure  C  for  the  entire 
neighborhood. 

The  thinness  measure  T  for  the  neighborhood  Is 
computed  as  that  fraction  of  the  remaining  six 
pixels  of  the  neighborhood  which  eve  non-edge  pixels. 
This  will  range  from  1  for  e  perfectly  thin  edge, 
down  to  0  for  e  very  blurred  edge. 

Neither  of  these  measures  is  independently  use¬ 
ful  for  edge  evaluation,  as  will  be  explained  below. 
However  e  linear  criebinatiou  of.  the  two 

E  -  YC  +  (l-v)-’ 

serves  quite  well  for  suitable  values  of  Y.  Phis 


One  defect  of  our  measure  (though  shared  by 
all  others)  is  that  it  can  only  be  applied  after 
thresholding.  We  endeavored  to  remedy  this  by 
devising  methods  that  treat  all  pixels  as  potential 
edge  pixels,  but  weigh  their  contributions  by  a 
function  of  their  edge  magnitudes.  Unfot Sunately , 
the  enormous  number  of  low-magnitude  pixels 
distorts  the  measure,  unless  the  weighting  function 
is  of  such  a  form  as  to  be  tantamount  to  thresholding. 

It  nl.ould  be  pointed  out  that  this  approach 
can  be  easily  adapted  to  measuring  the  good  form  of 
other  features,  such  as  lines  or  corners,  which  are 
normally  detected  by  some  sort  of  template  matching. 


EXPERIMENTS 


We  present  here  some  experiments  which  investi¬ 
gate  the  behavior  of  a  number  of  edge  detection 
schemes  under  various  condtlons.  To  permit  a 
comparison,  we  have  tried  to  make  our  experiments! 
setup  as  similar  as  possible  to  that  of  Abdou  and 
Pratt.  We  have  used  the  same  edge  detection 
schemes  (although  our  measure  also  makes  use  of 
edge  direction  information),  the  same  noise  model 
(additive,  Independent  zero-mean  Gaussian  nolae) , 
the  same  threshold  selection  criterion  (choosing 
that  threshold  which  maximizes  the  evaluation 
measure),  and  for  one  aeries  of  experiments, 
essentially  tha  sane  test  image. 

Teat  Imr.-et  end  Edge  Detectoru  Tested 

Two  test  images  of  edgae  wars  used:  tha  first, 
64  by  64  pixels,  consisted  of  a  left  panel  with  grey 
level  115,  e  right  panel  with  grey  levet  140,  end  e 
single  central  column  of  intermediate  grey  level 
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126.  This  we  will  call  Che  "vertical  edge"  Image. 
If.  is  virtually  the  same  as  ono.  of  the  test  images 
used  by  Abdou  and  Prs' t .  In  order  to  present 
conveniently  edges  e  .ill  orientations,  we  chose  a 
second  test  iua^e  co  isti^g  of  concentric  light 
rings  (grey  level  140)  or.  a  dark  background  (grey 
level  115) .  This  image  was  originally  generated  as 
a  512  by  512  image,  witir.  a  central  dark  circle  ot 
radius  64,  surrounded  by  three  bright  rings  of 
width  32,  these  being  separated  by  two  dark  rings 
of  the  same  width,  with  a  dark  surround.  The 
decision  as  to  whether  a  pixel  should  be  light  or 
dark  was  based  on  its  Euclidean  distance  from  the 
center  of  the  Image.  Then  this  image  war  reduced 
to  size  128  by  128,  by  replacing  each  4  by  4  block 
with  a  single  pixel  having  the  average  grey  level 
of  the  block.  The  reduction  gave  a  convergent 
way  of  approximating,  for  curved  edges,  the  evgi- 
tlzation  model  used  by  Abdou  and  Pratt.  We  call 
this  the  "rings"  image.  While  the  edges  in  thi3 
test  image  are  curved,  they  are  locally  almost 
straight,  at  all  possible  orientations. 

To  study  the  effects  of  noise,  independent 
zero-mean  Gaussian  noise  was  added  to  each  of  the 
test  images  at  seven  different  signal  to  noise 
ratios:  1,2,5,10,20,50  and  100.  Following  Pratt, 
the  signal  to  noise  ratio  (SNR)  is  defined  to  be 


where  h  is  the  edge  contrast  (in  this  case  25),  and 
a  is  the  standard  deviation  of  the  noise,  adjusted 
to  give  the  selected  values  of  SNR.  As  an  extreme 
case,  we  used  an  additional  64  by  64  test  Image 
with  no  well-formed  edges,  Just  Gaussian  noise 
with  mean  128  and  standard  deviation  16. 

Figure  3  shows  the  vertical  edge  image,  noise 
free  as  well  as  with  the  various  levels  of  added 
noise.  Figure  4  shows  the  same  for  the  rings  image. 

At  the  higher  signal  to  noise  ratios,  the  noise  is 
almost  imperceptible  to  the  human  eye.  However,  \t 
is  quite  significant  to  the  edge  detectors  used, 
since  all  of  them  have  only  small  domains. 

Ten  different  edge  enhancement  schemes  were 
tested.  The  first  group  are  the  so-called  "differ¬ 
ential"  operators.  These  measure  the  horizontal 
and  vertical  components  of  the  brightness  change 
by  applying  a  pair  of  linear  masks.  The  edge  gradient 
direction  is  computed  from  these  two  components 
using  the  inverse  tangent;  and  the  edge  gradient 
magnitude  is  computed  either  as  the  square  root 
of  the  sur  of  squares  of  the  two  components,  or  as 
a  sum  (or  sax)  of  absolute  values,  for  computational 
simplicity.  Three  different  pairs  of  masks  were 
used:  those  defined  by  Prewitt,  Sobel  and  Roberts. 
Since  the  edge  magnitude  can  be  computed  in  two 
ways,  this  gives  six  methods  altogether.  The  second 
group  arc  the  "template-matching"  operators;  three- 
level,  five-level,  Kirsc.h  and  compass-gradient. 

Each  of  these  applies  eight  masko  at  every  neighbor¬ 
hood.  The  edge  magnitude  is  taker,  to  he  the 
strongest  response  out  of  these  eight  masks,  and  the 
edge  gradient  direction  is  given  by  the  preferred 


orientation  of  the  strongest-responding  mask.  For 
details  on  and  references  to  all  these  operators , 
see  Abdou  and  Pratt  (1979) . 

Pa  tailed  Evaluation  of  One  Detector 

Before  presenting  an  overall  comparison  of 
these  edge  detection  schemes,  we  would  like  to 
examine  in  detail  the  results  of  the  edge  evaluation 
on  a  tingle  scheme  in  order  ko  discuss  the  uro- 
parties  of  the  edge  evaluation  measure  itself.  For 
this  wc  have  onosen  the  three-level  template  matching 
operator  becausu  It  performed  consiiten^ly  better 
than  any  of  the  other  operators  ir.  the  comparison 
experiments  described  below.  Even  so,  trie  results 
of  the  edge  evaluation  measure  follow  much  the  same 
pattern  for  the  other  operators  as  well. 

Figure  5  shows  the  histogram  sf  edge  magnitude 
outputs  for  the  three-level  operator  applied  to  the 
rings  'rnsge  with  SNR  50.  and  Figure  6  shows  the 
edge  magnitude  thresholded  at  nine  levels  squally 
spaced  through  its  range.  In  Figure  7  are  shown 
plots  of  edge  evaluation  agalnat  threshold  for 
various  valuer  of  the  weighting  factor  Y*  Figure 
8  shows  the  same  data,  but  plotted  Instead  agalnat 
the  fraction  of  pixels  which  are  edge  pixels  at 
each  threshold,  scaled  logarithmically.  This  la 
a  better  way  of  presenting  the  data,  since  it  is 
the  selection  of  edge  pixels  thet  really  matters, 
nut  the  threshold  directly. 

We  see  that  the  thinness  measure  alone  (I -0.0) 
is  of  little  use  for  edge  evaluation.  It  reaches 
its  maximum  value  at  high  thresholds  since  it  rates 
a  set  of  isolated  edge  pixels  higher  than  an  even 
slightly  blurred  edge.  On  the  other  hand,  the 
continuation  measure  performs  reasonably  well  by 
itself  (Y-l.G),  reaching  a  maximum  value  at  a 
threshold  which  selects  quite  a  good  set  of  edge 
pixels.  (This  peak  ia  more  pronounced  in  Figure  8, 
since  changing  the  threshold  near  the  maximum  pro¬ 
duces  only  u  small  change  In  the  population  of  edge 
pixels.  Notice  that  this  threshold  lies  in  the 
valley  of  the  histogram.)  However,  e  close  examina¬ 
tion  shows  that  these  edges  are  several  pixels 
thick.  Better  results  are  achieved  with  a  lower 
value  of  Y .  For  the  rest  of  this  paper  we  will 
use  Y«0.8,  since  this  value  seems  to  give  the 
best  compromise  between  continuation  and  thinness. 

Table  1  shows  the  maximum  values  of  the  edge 
evaluation  measure,  and  the  thresholds  at  which 
they  occur,  for  the  various  values  of  Y .  Figure  9 
shows  the  thresholded  edge  magnitude  for  a  range  of 
closely  spaced  thresholds  around  that  at  which 
the  edge  evaluation  takes  its  maximum  value  for 
Y-0.8.  Even  though  we  have  chosen  Y"0.8,  two 
remarks  should  be  wade:  Firstly,  values  of  y  from 
0.6  to  0.9  produce  similar  results.  Secondly,  Y 
can  be  adjusted  depending  on  the  relative  serious¬ 
ness  of  btoken  edgej  a<s  against  thickened  edges,  for 
a  given  application,  la  general,  v  should  be 
fairly  high,  since  fillies  breaks  in  edges  is 
usually  a  more  difficult  task  than  edge  thinning. 
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To  show  that  She  peaks  In  Figure  8  are  actually 
caused  by  more  or  leas  well  formed  edges,  we  give 
in  Figurs  10  an  Analogous  plot,  but  for  the  test 
image  of  pure  noise.  The  forms  of  the  curves  are 
quite  different,  without  any  well-defined  peaks  for 
the  higher  values  of  Y  However,  this  graph  does 
reveal  a  noteworthy  property  of  the  edge  evaluation 
•neasure;  Even  on  ar  image  of  pure  noise,  if.  is 
possible  to  choose  c.  threshold  which  gives  a  mod¬ 
erately  high  value  of  the  edge  evaluation  measure. 

At  first  thought,  this  may  seem  to  be  a  defect. 
However,  on  reflection,  it  is  clear  that  this  is 
an  inevitable  characteristic  of  any  such  treasure 
based  on  local  good  form.  Because  of  overlap 
between  the  neighborhoods  to  which  the  edge  opera¬ 
tors  are  applied,  an  isolated  noise  point  will 
produce  a  carrel Ated  set  of  edge  pixels.  For 
example,  the  three-levci  operator  will  produce  a 
tiny  ring.  Even  though  this  ring  is  highly  curved, 
it  is  coherent,  and  will  receive  a  moderate  edge 
evaluation  score.  This  evaluation  score  for 
isolated  noise  spots  can  be  computed  analytically 
as  an  intrinsic  property  of  each  edge  detection 
scheme.  For  an  imago  of  pure  noise,  as  used  for 
Figure  10,  the  evaluation  is  somewhat  lower  than 
oni  would  expect,  apparently  because  of  interfer¬ 
ence  between  adjacent  noise  pixels.  In  susuary; 
even  in  a  noiey  Image,  there  will  be  a  certain 
occurence  of  well-formed  edges,  either  by  accidental 
alignment  or  as  an  artefact  of  the  edge  detection 
scheme  used.  It  is  not  the  fault  of  the  edge  eval¬ 
uation  measure  that  it  reflects  this  unavoidable 
property  of  the  Images  and  detection  schemes  used. 

Figure  11  illustrates  the  effects  of  various 
levels  of  noise  on  the  edge  evaluation  measure. 

For  clarity,  only  a  subrange  of  the  data  is  j  lotted. 
Outside  this  subrange  the  plots  for  the  different 
noise  levels  tend  to  converge.  The  results  show  a 
consistent  pattern:  The  peaks  decrease  In  step 
with  the  signal-to-noise  ratio.  Below  SNR"10,  there 
are  no  clear  peaks,  but  the  shapes  of  the  curves 
show  that  the  presence  of  edges  still  has  some  effect 
on  the  edge  evaluation.  \lthough  we  have  not 
pursued  the  matter,  this  suggests  that  au  edge  eval¬ 
uation  measure  might  be  based  on  the  value  E  of  the 
measure  for  the  given  Image  relative  to  the  value 
measured  for  the  same  detector  on  an  image  of  pure 
noise.  But  such  a  relative  measure  would  be  use¬ 
ful  only  for  cases  of  high  noise,  when  E  has  no 
clear  peaks,  and  would  not  be  a  good  means  of  com¬ 
paring  the  outputs  of  different  edge  operators. 

Away  from  the  peaks,  the  evaluations  for  the  differ¬ 
ent  noise  levels  tend  to  become  similar,  while 
retaining  the  same  ordering.  This  shows  that  a 
poor  threshold  leads  to  a  bad  selection  of  edges, 
no  matter  how  noisy  the  original  image. 

AH  the  above  results  are  pretty  much  what  one 
would  expect  Intuitively  from  a  measure  of  edge 
quality.  They  thus  serve  to  confirm  the  validity 
of  the  edge  evaluation  measure.  While  the  figures 
show  the  rosults  for  the  rings  image,  the  results 
for  the  vertical  image  are  similar,  and  If  anything, 
more  distinct,  since  a  vertical  edge  can  be  more 
cleanly  digitised,  and  has  not  even  the  slightest 
curvature. 


Coapar i son  of  Detectors 

Having  established  that  the  measure  E  behaves 
well,  we  now  present  a  comparison  among  the  tan  edge 
enhancement  operators  mentioned  above.  Every  opera¬ 
tor  was  applied  to  the  test  image  at  the  seven  differ¬ 
ent  slgnal-to-noiae  ratios,  and  at  each  noise  level 
the  threshold  was  adjusted  to  maximize  E.  Figures 
12  and  13  show  these  maximum  values  for  the  differen¬ 
tial  and  template-matching  operators  respectively 
using  the  rings  image.  As  expected,  these  results 
show  that,  the  three  by  three  operators  ate  far  better 
than  the  two  by  two  operators  at  detecting  edges  in 
the  presence  of  noise.  Among  the  three  by  three 
operators,  the  throe-level  operator  1b  clearly  the 
best,  and  the  compacA  gradient  the  worst.  The 
other  four  operators  produce  results  of  wbout  the 
same  quality.  The  same  ordering  is  preserved  if 
we  subtract  out  the  intrinsic  response  for  each 
operator  on  pure  noise,  although  the  separations 
ar.  not  so  great. 

Analogous  results  for  the  vertical  edge  image 
are  shown  in  Figures  14  and  15.  They  are  not 
directly  comparable,  especially  at  the  lower  SNRs, 
because  the  r'  .gs  image  has  a  greater  density  of 
edges.  However,  some  general  remarks  can  be  made. 
Firstly,  as  explained  earlier,  the  vertical  edge 
gives  a  higher  evaluation.  Secondly,  the  evalua¬ 
tions  of  the  tour  three  by  three  differential 
operators  are  more  spread  out.  This  can  be  attri¬ 
buted  to  relative  orientation  biases  in  the  four 
operators  which  are  brought  out  by  the  vertical 
edge,  but  which  are  cancelled  out  over  the  full 
range  of  edge  orientations  in  the  rings  image. 

Overall,  this  comparison  is  in  accord  with 
the  findings  of  Abdou  and  Prutt.  Ouc  results  differ 
from  theirs  only  when  the  difference  between 
operators  is  small  by  both  measures.  They  also 
find  the  three  by  three  operators  consistently 
better  thru  the  two  by  two.  However,  at  the  high¬ 
est  signal  to  noise  ratios,  the  performance  of  the 
two  by  two  operators,  according  to  their  figure  of 
merit,  approaches  that  of  the  three  by  three,  while 
our  measure  still  reveals  a  considerable  difference. 
This  shows  that  while  the  two  by  two  operators  can 
properly  locate  edges  at.  low  noise  levels,  they 
poorly  estimate  the  edge  direction. 

By  both  their  measure  and  ours,  the  compass 
gradient  i3  the  worst  of  the  three  by  three  opera¬ 
tors.  but  tneii  figure  of  merit,  while  rating  the 
three-ievel  operator  fairly  highly,  does  not  show 
it  as  clearly  superior  in  all  cases.  These  smnll 
discrepancies  are.  not  at  all  surprising,  since  the 
two  edge  evaluation  schemes,  after  all,  measure 
quite  different  character lstics  of  edgec.  The 
general  agreement  between  the  two  schemes  is  encour¬ 
aging:  It  set  s  both  to  confirm,  in  large  part, 
the  edge  operator  ratings  of  Abdou  and  Pratt,  but 
from  a  different  perspective;  and  also  to  strengthen 
our  confidence  in  the  usefulness  of  the  measure  E. 
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CONCLUSIONS 


Effects  of  i reprocessing 

Quite  a  number  of  techniques  have  been  pro¬ 
posed  for  Improving  tile  quality  of  edge  detection. 

We  present  here  some  experiments  to  demontrate 
bow  the  effect  of  a  selection  of  these  teci-nlques 
is  reflected  in  the  edge  evaluation  measure  E. 

For  coping  with  the  effects  of  noise,  two 
commonly  used  techn'ques  are  mean  and  median  filter¬ 
ing,  that  is,  each  pixel  In  the  original  image  is 
replaced  by  respectively  the  mean  or  median  of  the 
grey  levels  in  a  neighborhood  around  the  pixel. 

This  has  the  effect  of  smoothing  out  irregularities 
dee  to  noise.  However,  as  is  widely  known,  mean 
filtering  has  the  unfortunate  side  effect  of  blur¬ 
ring  or  thickening  real  edges,  so  median  filtering 
is  often  preferred  since  it  does  not  suffer  from 
this  defect.  On  the  other  hand,  thickening  of 
edges  can  usually  be  dealt  with  by  non-maximum 
suppression  on  the  edge  gradient  magnitudes  -  that 
is,  a  pixel  has  its  magnitude  set  to  zero  unless 
it  is  a  local  maximum  among  those  pixels  which  lie 
closest  to  the  edge  gradient  direction. 

Figure  16  shows  the  effects  of  mean  and  median 
filtering  on  E  for  different  neighborhood  sizes. 

As  can  be  seen,  the  edge  quality  as  measured  by  E 
is  improved  by  both  mean  and  median  filtering,  but 
if  the  neighborhood  is  too  large,  mean  filtering 
causes  a  decrease  in  edge  quality  because  ot  blur¬ 
ring,  while  mean  filtering  suffers  from  no  such 
defect,  although  It  seens  less  effective  with 
smaller  neighborhoods.  This  graph  also  shows  the 
effect  of  applying  non-maximum  suppression  to  the 
edge  magnitude  output  after  vienn  filtering.  Even 
when  no  averaging  is  done  (the  case  of  a  one  by 
one  neighborhood) ,  non-maximum  suppression  causes 
a  small  Improvement  in  edge  quality,  by  counter¬ 
acting  the  slight  blurring  introduced  by  the  edge 
operator  masks.  When  the  averaging  is  done  over  a 
larger  neighborhood,  the  improvement  is  more  sig¬ 
nificant,  reaching  a  maximum  when  the  mean  filter¬ 
ing  is  done  over  the  same  sized  neighborhood  as  the 
non-maximum  suppression  (that  is,  three  by  three). 

That  the  above  interpretation  of  Figure  16  Is 
correct  Is  shown  In  Figures  17  and  18,  which  are 
analogous,  but  use  7*0.6,  giving  more  weight  to 
edge  thinness,  and  y*1.0,  showing  the  effect  on  the 
continuation  measure  alone.  These  graphs  reveal 
the  relative  effects  of  the  operators  on  edge  con¬ 
tinuity  and  edge  thinness. 

leleg  [1978]  has  devised  a  technique  for  edge 
Improvement  that  fills  small  gaps  and  straightens 
out  irregularities  in  edges.  The  effect  of  this 
process  on  edge  output  is  presented  in  Table  2. 

While  Peleg's  technique  certainly  improves  the  form 
of  edges,  it  has  the  undesirable  side-effect  of 
thickening  them.  However,  this  can  be  overcome  by 
applying  non-maximum  suppression,  as  is  also  shown 
in  Table  2.  Again,  the  relative  effects  of  this 
process  on  edge  continuation  and  edge  thinness  can 
be  seen  by  comparing  the  results  for  the  different 
values  of  y. 


We  have  presented  a  method  for  evaluating  the 
quality  of  edge  detector  output  baaed  solely  on  the 
local  good  form  of  the  detected  edges.  It  combines 
two  desiderata  of  a  well-formed  edge  -  good  contin¬ 
uation  and  thinness.  This  measure  behaves  as  one 
would  like  under  the  effects  of  change  of  threshold, 
noise,  blurring  and  other  operations.  The  comparison 
experiments  show  that  the  results  obtained  with  this 
measure  are  similar  to  those  obtained  with  a  mea¬ 
sure  based  on  the  discrepancy  of  the  detected  edge 
from  a  known  actual  edge  position.  The  small  dif¬ 
ferences  between  the  two  methods  reveal  some  pro¬ 
perties  of  the  operators  not  brought  cut  by  the 
other  approach. 

Like  other  evaluation  measures,  ours  can  be 
used  to  compare  the  effectiveness  of  different 
edge  detection  schemes  and  edge  Improvement  schemes 
on  synthetic  Images.  However,  since  our  measure 
does  not  require  knowledge  of  the  true  location  of 
edges,  it  has  much  wider  application.  It  can  be 
used  to  adjust  parameters,  such  as  thresholds,  for 
optimum  detection  of  edges  in  real  Images  for 
which  edge  location  is  unknown.  The  evaluation  of 
the  detected  edges  can  also  serve  as  an  indication 
of  the  quality  of  the  original  image.  Further,  the 
approach  of  using  local  coherence  can  be  extended 
to  the  evaluation  of  other  local  feature  detectors. 
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titroahold 

Rings  Image,  SNR-10, 
at  which  this  occurs 

three  level  operator. 

,  for  various  values  of 

Maximum  value  of  edge 

Y. 

evaluation  measure 

0.6 

0.8 

1.0 

SNR  10 

0.771 

0.759 

0.757 

Enhanced 

0.786 

0.790 

0.806 

Enhanced  6 

non-maximum 

suppression 

0.841 

0.823 

0.805 

■^able  2.  Effect  of  Pc.leg '  s  edge  enhancement  procedure 

on  odge  evaluation. 
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Figure  S.  Histogram  of  edge  magnitude  obtained 
by  applying  thraa-laval  operator  to  rings  image 
at  SNR  50. 


Figure  6.  Edge  pixels  extracted  by  thresholding 
edge  magnitude  (three-level  operator)  on  rings 
imago  at.  SNR  50.  Thresholds,  from  left  to  right, 
tip  to  bottom:  10*.  20X,  30X,  40X,  SOX,  60X,  70X, 
SOX  and  90X  of  range. 


Threshold  (fraction  of  maximum  magnitude) 

Figure  7.  Using  rings  test  image  at  SNR  50  and  three- level  operator:  edge 
evaluation  against  threshold  for  various  values  of  parameter  >. 
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Edge  pixel  fraction  (log  scaled) 


Figure  8.  Using  rings  test  image  at  SNR  50  and  three-level  operator: 
edge  evaluation  against  fraction  of  edge  pixels  at  each  threshold. 


. . . 
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INTRODUCTION 

This  paper  reports  on  research- -primarily  at 
Marr  and  I’oggio's  19|  mechanism  level— to  design 
a  practical  hardware  stereo-matcher  and  on  the  in¬ 
teraction  this  study  has  had  with  our  understand¬ 
ing  of  the  problem  at  the  computational  theory 
and  algorithm  levels.  The  stereo-matching  al¬ 
gorithm  proposed  iiy  Mvr  and  Poggio  [10|  and 
implemented  by  Giiroson  and  Marr  [3]  is  consis¬ 
tent  with  what  is  presently  known  about  human 
stereo  vision  (2).  ’Ihcir  research  has  been  con¬ 
cerned  with  understanding  llic  principles  underly¬ 
ing  the  stereo-matching  problem.  Our  objective 
has  been  to  produce  a  stereo-matcher  dial  operates 
reliably  at  near  real  time  rates  as  a  tool  to  facilitate 
further  research  in  vision  and  for  possible  applica¬ 
tion  in  robotics  and  sicrco-photogrammetry.  At 
present  the  design  and  construnioi.  of  the  camera 
and  convolution  modules  of  this  project  have  beer, 
completed  ar.d  the  design  of  the  /era-crossing  and 
matching  modules  is  progressing,  lire  remainder 
of  this  section  provides  a  brief  description  of  the 
Marr  and  Poggio  stereo  algorithm.  We  then  dis¬ 
cuss  our  general  approach  and  some  of  die  issues 
that  have  come  up  concerning  the  design  of  the 
individual  modules. 

There  arc  two  distinct  approaches  to  identify¬ 
ing  correspondences  between  locations  in  the  left 
and  right  images  of  a  stereo  pair.  The  first  is  to 
focus  on  the  local  pattern  or  arrangement  of  some 
fine-scale  mulching  primitive,  attempting  to  deter¬ 
mine  the  mapping  between  left  and  right  images 
which  best  correlates  these  patterns  [cf  5,  6,  1,  4,  8, 
12],  TTic  other  approach  (10)  is  to  focus  on  the  use 
of  primitives  sensitive  to  image  details  at  different 
scales  so  that  matching  can  be  accomplished  first 
at  the  ~oarscst  scale  and  then  at  successively  finer 
scales,  lire  density  of  such  primitives  in  an  image 
is  lied  to  the  scale  at  which  they  are  sensitive  which 
makes  it  possible  to  use  a  simple  matching  rule 
such  as: 

For  each  primitive  element  in  the  left  image,  look 
in  a  horizontal  interval  in  the  right  image  about  the 
corresponding  location.  If  there  is  only  one  primitive 
there  that  could  match  it,  accept  it  as  the  match¬ 
ing  element.  If  there  are  no  potential  matches  in 


the  search  interval,  note  this  fact  as  evidence  that 
the  search  window  may  he  impropetly  positioned  in 
the  right  image.  If  there  is  more  than  one  poten¬ 
tial  match  in  the  search  interval,  the  match  is  am¬ 
biguous  so  skip  over  this  point 

The  length  of  the  horizontal  search  interval  can 
be  chosen  so  dial  a  sufficient  percentage  of  unique 
matches  is  found,  lire  course-scale  primitives 
allow  the  use  of  a  larger  search  interval,  thus 
gaining  disparity  range  in  exchange  for  resolution 
and  die  density  of  matches  that  can  be  obtained. 
Matching  primitives  at  a  finer  scale— which  re¬ 
quires  a  shorter  search  interval— can  then  be  ac¬ 
complished  by  positioning  the  search  interval  using 
die  rough  disparity  information  obtained  from 
matching  the  coarser  primitives. 

Marr  and  Poggio  [10]  found  that,  for  this 
second  approach,  peaks  in  the  rate  of  intensity 
change  in  the  image  at  a  given  scale  of  resolution 
were  die  appropriate  type  of  matching  primitive. 
Peaks  in  the  rate  of  intensity  change  along  the 
direction  cf  the  local  intensity  gradienl~or  equiv¬ 
alently  signed  zero-crossings  in  the  second  deriva¬ 
tive—  corrcla’  e  with  physical  markings  and  edges 

on  surfaces.  However,  zero-crossings  in  ik-  second 
derivative  of  the  image  arc  most  sensitive  to  details 
at  the  finest  scale  of  resolution.  This  problem 
can  be  dealt  vviili  by  fust  low  pass  filtering  the 
image  to  attenuate  high  spatial  frequency  structure 
above  the  scale  of  resolution  devred.  It  turns 
out  that  Gaussian  smoothing  offers  the  best  com¬ 
promise  for  attenuating  fine  scale  structure  in  the 
im„ge  while  preserving  die  local  geometry  at  larger 
scales  |7J.  Marr  and  Hildreth  [7]  found  that 
with  weak  restrictions,  zeros  in  the  Laplacian  of  a 
Gaussian  convolved  image-  V'(G*/)—  gave  the 
desired  result.  !v.i  'hcnnorc,  V>(G*I)=(V1G)*I 
and  7 2G  cati  be  closely  approximated  by  the 
difference  oi  two  Gaussians  having  different  space 
constants  but  normal  kd  to  have  die  same  volume. 

DESIGN 

I  he  hardware  implementation  project  grew 
out  of  our  effort  in  spring  1980  to  construct  sup* 
porting  hardware  for  a  VLSI  convolution  chip 
being  developed  at  that  time  at  Hughes  Research 
1-abs  [11].  A  "serpentine  memory"  device  was  re- 
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quircd  to  buffer  video  output  from  a  TV  camera 
so  that  26  successive  image  lines  could  be  fed 
to  the  convolver  in  parallel  at  video  rates.  The 
VI  SI  circuit  would  then  convolve  the  image  with 
a  built-in  26  by  26  difference  of  Gaussian  mask, 
producing  a  video  convolution  signal— lagging  the 
TV  camera  input  by  26  lines.  In  the  process  of 
doing  this,  we  found  that  a  practical  Gaussian  con¬ 
volver  could  he  designed  using  standard  digital 
Til.  technology.  This  approach  took  advantage 
of  the  Gaussian's  separability  which  makes  it  pos¬ 
sible  to  decompose  the  two-dimensional  convolu¬ 
tion  into  two  successive  one-dimensional  Gaussian 
convolutions,  reducing  the  number  of  multiplica¬ 
tions  required  at  each  image  point  from  mJ  to 
2m  where  m  is  the  mask  diameter.  Although  the 
speed  of  such  a  device  would  be  slower— 1  Mil? — 
than  the  promised  speed  of  the  Hughes  chip,  it 
had  several  advantages.  The  extensive  analog  cir¬ 
cuitry  necessary  to  operate  the  Hughes  chip  could 
be  avoided:  the  all  digital  design  would  be  easier 
to  debug  and  would  allow  more  precision;  variable 
sized  Guussians  would  Ik-  allowed:  and  vve  would 
have  better  logistical  control  over  the  project’s 
time  frame.  Shortly  thereafter  wc  also  worked  out 
a  rough  design  for  a  stci  u-matcher  using  the  same 
technology. 

I)u  in;  this  peril ij  fie  fol\. wing  .guidelines 
foi  ouri.es  go  wer .  fount  fated: 

(T  Tiieucir-niatil  mg  system  should  be  raster- 
based  ii u  pipelined.  C<ur  object i  ,c  here  was  to 
climii,  tic  large  "  'ram;  buffer’’  memories  for  stor¬ 
ing  it  crmeeliate  results  during  the  operation  of 
die  si  trco-malehcr  v  here  Uiey  were  unnecessary. 
Wc  t  .rcady  had  a  design  for  doi « j  convolutions 
that  operated  directly  off  of  a  TV  Fttslci'  input  and 
it  tilso  seemed  that  the  stcrco-me.l  Thing  could  be 
accomplished  directly  off  of  parallel  raster  inputs 
from  the  left  anel  right  camera-convolver  com¬ 
binations.  This  freedom  from  frame  buffer  re¬ 
quirements  allows  us  to  design  for  image  sizes 
much  larger  than  would  otherwise  be  convenient 
to  manage— the  present  system  design  handles 
images  with  up  to  1024  pixels  on  a  line  and  places 
no  restriction  on  the  number  of  lines  in  the  image. 

(2)  The  system  should  be  modularized  in  a  way 
that  facilitates  development.  When  wc  began  wc 
were  quite  certain  about  the  kind  of  convolution 
that  was  needed  for  stereo-matching  but  wc  were 
less  sure  about  the  type  of  camera  appropriate 
for  the  task  or  about  many  of  the  details  of  the 
matching  hardware.  So  wc  wanted  the  overall 
system  design  to  support  at  least  two  modes  of 
operation.  The  full  pipelined  mode,  illustrated 
in  figure  1,  which  begins  with  two  cameras  and 
produces  some  form  of  raster  disparity  output  as 
well  as  any  of  the  intermediate  raster  signals  such 
as  the  convolution  values.  The  second  mode  which 
we  call  tlte  memory  to  memory  mixlc  allows  us 
to  use  an  array  stored  in  computer  memory  as 
it  c  source  of  input  to  any  of  the  modules,  and 


another  array  as  the  destination  for  its  output 
This  makes  it  possible  to  design  and  debug  the 
modules  individually  using  a  host  computer— in 
our  case  the  Ml  I'  A I  laboratory’s  l.isp  machine— 
as  a  sophisticated  test  bed.  figure  2  show  >  tire  way 
live  convolution  module  is  interfaced  to  the  Lisp 
machine’s  Xhtis  to  accomplish  this.  Variations  on 
this  mode  allow  the  whole  system  to  be  run— at 
slower  rates- -incorporating  experimental  designs 
for  one  or  more  of  the  modules  as  mierocodcd 
simulations  on  die  i  .isp  machine. 

(.1)  The  system  should  be  developed  in  a  tech¬ 
nology  that  provides  good  design  aids  and  allows 
rapid  construction  and  easy  circuit  modification. 
The  best  hardware  development  support  at  our  lab 
is  for  ITI.  technology  on  machine  wire  wrapped 
boards.  Wire  wrapped  circuits  arc  also  easy  to 
modify  during  the  debugging  process.  We  feel 
dial  this  gives  us  the  best  combination  of  com¬ 
putational  power,  development  case,  and  circuit 
reliability.  Wc  expect  that  once  experience  has 
been  gained  with  the  prototype  system,  a  transfer 
to  other  technologies  such  as  VLSI  will  become  be 
more  practical. 

cameras 

Our  prototype  system  requires  two  synchro¬ 
nized  raster  inputs  from  left  and  right  cameras  at 
a  pixel  rate  of  about  1  MHz.  The  sources  can 
be  cameras  or  images  stored  in  computer  memory. 
A  particular  concern  for  stereo-matching  is  that 
geometric  distortion  between  left  and  right  images 
be  held  to  u  minimum,  for  the  memory  to 
memory  mode  of  operation,  this  can  be  accomp¬ 
lished  by  using  the  same  image  digitizer  for  both 
images,  for  real  time  applications,  however,  the 
'j.comctric  distortion  of  the  cameras  should  l>c  less 
•Jiao  a  pixel  diameter  if  possible.  'Ihcrc  arc  means 
for  compensating  for  fixed  geometric  distortions 
between  two  cameras,  but  it  would  be  desirable  to 
avoid  having  to  use  them. 

We  considered  three  types  of  camera  for  real 
time  operation,  beam  deflection  devices  such  as 
vidicon.v,  two-dimensional  CCD  cameras,  and  one- 
dimensional  CCD  cameras  with  mirror  deflection 
systems.  The  vidicon  family  of  cameras  is  the  best 
developed,  and  a  wide  range  of  devices  is  avail¬ 
able.  However,  they  generally  have  large — on  the 
order  of  one  percent  or  more— geometric  distor¬ 
tions  and  for  die  most  part  are  designed  to  operate 
at  speeds  higher  than  1MHz. 

The  two-dimensional  CCD  arrays  have  high 
geometric  accuracy,  but  at  present  they  have  limited 
image  sizes  and  tend  to  have  pixel,  line  or  column 
flaws.  Otherwise,  these  devices  have  good  light 
sensitivity,  they  arc  small,  and  at  least  some  ver¬ 
sions  can  be  run  at  the  speeds  wc  desire.  Wc 
expect  that  in  the  long  run  these  will  be  the 
most  practicable  cameras,  especially  for  robotics 
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applications  where  more  rapid  frame  rates  maybe 
preferable  to  very  large  image  dimensions —given 
the  constraint  of  a  fixed  pixel  rate. 

A  one 'dimensional  solid  suite  CCD  array 
camera  in  conjunction  with  a  scanning  mirror 
provides  good  geometri.  precision  with  large  image 
dimensions  and  no  pixel  flaws.  The  only  shortcom¬ 
ing  cr  this  type  of  camera  for  our  purposes  i3  its 
lower  light  sensitivity.  This  problem  occurs  be¬ 
cause  die  intcgialion  time  of  the  individual  sen¬ 
seis  is  limited  to  the  time  spent  on  a  single  line 
of  the  image— for  a  1024  element  array  running 
at  1MHz.  this  is  about  1ms.  Two-dimensional  sen¬ 
sors  in  comparison  have  integration  times  limited 
by  the  lime  taken  to  scan  the  whole  frame.  During 
the  summer  of  1980,  we  acquired  a  256  element 
linear  array  camera  constructed  by  R.  Bishop  as  a 
thesis  project  at  MIT.  When  operated  at  a  I  MHz 
pixel  rate,  this  camera  had  an  integration  time  of 
approximately  0.25ms  and  perl'otmed  reasonably 
well  under  studio  lighting  conditions. 

from  this  we  decided  that  an  in  house  design¬ 
ed  1024  pixel  linear  array  scanner  would  best  meet 
our  requirements  over  the  next  year  or  so.  The 
camera  we  developed — sec  figure  3 — makes  heavy 
use  of  olf-thc-shclf  components.  A  Rcticon  evalua¬ 
tion  hoard  is  used  to  operate  the  linear  array  and 
most  of  the  necessary  analog  circuitry  is  provided 
on  that  circuit  board.  A  separate  analog  to  digital 
converter  was  designed  and  a  third  circuit  board 
was  designed  to  generate  the  ramp  signal  for 
the  mirror  controller  as  well  as  the  vertical  and 
horizontal  syncs.  A  controller  board  and  closed- 
loop  galvanometer  produced  by  General  Scanner 
Inc.  is  used  to  move  the  mirror.  The  nii'Tor 
sweep  is  specified  to  be  linear  to  within  0.15  per¬ 
cent  of  die  peak  to  peak  sweep  angle  and  mirror 
repeatability  is  claimed  to  be  within  1  second  of 
arc. 

Our  preliminary  performance  observations  arc 
that  the  camera's  geometric  precision  and  repeat¬ 
ability  meet  our  requirements.  The  camera’s  light 
sensitivity  also  seems  satisfactory  and  we  hope  to 
be  able  to  operate  the  camera-convolver  combina¬ 
tion  at  normal  office  light  levels.  We  have  had 
difficulty  adjusting  the  balance  'controls  on  the 
Rcticon  board— for  the  even  and  odd  pixel  sig¬ 
nals  which  arc  output  separately  by  the  linear  ar¬ 
ray.  While  this  problem  is  not  significant  when 
the  camera  output  is  convolved  with  a  difference 
of  Gaussian  mask,  we  may  improve  the  analog  cir¬ 
cuit  or  to  do  digital  compensation  to  support  other 
uses  of  the  camera.  More  quantitative  performance 
measurements  arc  in  progress. 

The  ovcrali  simplicity  of  this  design  makes  it 
easy  for  us  to  tailor  the  cameras  to  changing  re¬ 
quirements  that  may  develop  as  the  overall  project 
evolves.  For  example,  should  it  become  necessary 
to  vary  the  relative  vertical  alignment  between  the 
left  and  right  cameras  as  the  scan  progresses  across 


the  image— to  compensate  for  changes  in  surface 
depth— the  ramp  generator  can  be  modified  casity 
to  accept  a  dynamic  vertical  offset  correction  signal 
from  the  stereo-matcher. 

An  effort  has  also  been  made  to  standardize 
the  interface  to  the  convolver  so  that  other  camera 
types  can  be  used  in  die  future  with  a  minimum  of 
difficulty.  'Die  interface  requires  (l)  a  pixel  clock 
signal  with  a  frequency  of  up  to  I  MHz,  (2)  vertical 
sync.  (3)  horizontal  sync,  and  (4)  8  bit  pixel  data. 
T  he  convolver  accepts  camera  line  lengths  from  1 
to  1024  pixels. 

CONVOLVER 

The  convolution  module  was  the  central  focus 
of  the  first  half  of  our  development  effort  because 
of  the  large  computational  requirements  that  arise 
in  its  operation.  For  digital  Gaussian  convolution 
with  a  32  by  32  mask  size,  a  minimum  of  32  mul¬ 
tiplies  arc  required  for  each  pixel  of  the  image. 
To  maintain  adequate  precision,  the  first  one¬ 
dimensional  convolution  requires  16  8x8  multi¬ 
plies  while  the  second  one-dimensional  convolu¬ 
tion  requires  16  8x16  multiplies.  Using  TRW  mul¬ 
tiplier  chips  we  arc  able  to  achieve  a  pixel  rate  of 
just  under  1MHz.  Higher  pixel  rates  should  be  ob¬ 
tainable  with  more  parallel  or  analog  designs  such 
as  the  Hughes  convolver  chip. 

Another  issue  is  whether  to  compute  a  differ¬ 
ence  of  two  Gaussians  or  the  I-aplacian  of  a  single 
Gaussian  convolution.  Generally,  given  Gaussian 
convolutions  with  a  given  precision,  the  difference 
of  Gaussian  approach  offers  a  better  signal  to  noise 
ratio  because  of  the  second-order  differences  in¬ 
volved  in  computing  the  l^placian.  The  design  of 
two  copies  of  the  same  convolution  circuit  would 
also  be  easier  so  die  difference  of  Gaussian  was 
selected. 

The  resulting  hardware  takes  8  bit  pixels  as 
input  image  data  and  produces  signed  16  bit  num¬ 
bers  as  output.  The  32x32  Gaussian  mask  size 
allows  difference  of  Gaussian  convolution  masks 
with  central  positive  diameters  w  of  from  2  to 
about  12  pixels.  In  memory  to  memory  mode  on 
the  I.isp  machine,  a  1000x1000  by  8  bit  image  can 
be  convolved  and  the  result  stored  as  a  1000x1000 
by  16  bit  array  in  about  20  seconds— 1,5  seconds 
for  convolving  ar.d  die  remainder  for  paging  be¬ 
tween  disk  storage  and  memory. 

With  the  completion  of  the  linear  array  cam¬ 
era,  we  have  been  working  with  a  camera-to* 
ccnvolver-to-display  mode  using  a  microcode  loop 
to  read  the  convolver  output  and  write  the  sign 
bits  of  the  convolution  to  the  bit  map  of  the  Lisp 
machine’s  black  and  white  monitor.  This  gives 
us  a  real  time  display  of  zero-crossings  from  the 
camera  and  we  have  begun  to  us*  it  to  get  a  better 
familiarity  with  what  the  world  looks  like  in  these 
terms.  One  quickly  notices  that  zero-crossings  oc- 
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cur  locally  in  response  io  the  features  willi  the 
largest  contrast  and  llic-y  tend  to  form  closed  loops 
with  diameters  on  the  order  of  n>.  If  there  arc  sharp 
intensity  edges  the  zero-crossings  follow  them  lo¬ 
cally  and  lower  contrast  or  smaller  features  have 
little  influence.  In  tire  absence  of  more  significant 
intensity  fluctuations,  however,  even  very  small  in¬ 
tensity  variations  such  as  are  on  a  sheet  of  white 
bond  paper  give  rise  to  zero-crossings,  llie  zero- 
crossings  generated  by  lower  contrast  features  arc 
more  susceptible  to  noise  such  as  the  variations 
due  to  120  cycle  illumination  flicker.  This  par¬ 
ticular  noise  source  should  not  be  a  problem  for 
our  matching  scheme  because  the  left  and  right 
cameras  will  be  synchronized  and  so  will  see  the 
same  zero-crossing  distortions.  They  may  even 
contribute  to  the  texture  on  otherwise  uniform  sur¬ 
faces. 


MATCHING 


flic  matching  module  presently  being  design¬ 
ed  will  Inne  three  components,  (!)  a  zero-crossing 
coder  which  detects  zero-crossings  in  the  convolu¬ 
tion  raster  signal  and  codes  the  orientation,  and 
possibly  llie  gradient  or  curvature  of  the  zero- 
crossing  contour  at  that  location;  (2)  a  matcher 
that  takes  the  coded  zero-crossing  rastcis  from 
the  left  and  right  cameras  and  computes  the  dis¬ 
parities  of  unique  matches  when  they  occur  as  well 
as  information  about  die  absence  of  any  matches 
at  locations  where  they  were  expected;  and  (.)) 
some  form  of  statistics  checker  for  filtering  good 
matches  from  random  incorrect  matches  which 
occur  when  the  matcher's  search  window  is  im¬ 
properly  positioned. 

The  stereo  matching  algorithm  developed  by 
Grimson  from  Marr  and  Puggio's  theory  operates 
on  roughly  varbcal  zero-crossings  in  difference  of 
Gaussian  filtered  images.  Tor  each  such,  zero- 
crossing  point  in  the  left  image,  c  window  in  die 
other  image  one  line  high  and  tv  in  length  (where 
w  is  the  diameter  of  the  positive  part  of  the  DOG 
filter)  is  searched.  If  there  is  only  one  zero-crossing 
in  the  window  that  matches  die  zero-crossing  from 
the  left  image,  it  is  taken  as  a  candidate  match 
to  be  further  validated  by  a  subscoucnt  statistical 
check.  Grimson  has  made  effective  use  of  this 
scheme  in  furthering  our  understanding  of  the 
stereo  matching  problem  and  in  demonstrating 
dial  the  Marr  and  Poggio  theory  is  consistent  with 
the  known  psychophysical  data  on  stereo  vision. 


A  straightforward  hardware  design  was  worked 
out  'for  a  matcher  folio. ving  the  general  idea  of 
Crimson's  algorithm.  However  in  tests  o.f  a  soft¬ 
ware  simulation  of  the  design,  we  found  that  K 
was  very  sensitive  to  small  vertical  misalignments 
between  the  two  images  and  decided  ro  study 
the  problem  further.  Grimson  had  observed  this 
problem  also  aud  solved  it  by  running  the  match¬ 
ing  algorithm  several  times  with  differ  ent  -.  si- 


deal  offsets  between  the  ove  images,  collecting 
the  good  matches— as  determined  by  die  statistics 
checking— from  each.  A  major  objective  of  tile 
hardware  implementation,  however  is  speed  so  it 
would  be  preferable  to  make  the  matching  less 
sensitive  to  small  misalignments  or  to  find  a  -say 
to  correct  the  misalignments  prior  to  the  matcher 
to  reduce  the  number  of  passes  required  over  t* 
image. 

'flic  matcher's  sensitivity  to  small  vertical 
misalignments  has  two  roots:  (I)  matching  is 
carried  out  pixel  by  pixel  on  roughly  vcrt;  ; 
zero-crossing  segments  that  arc  often  leys  d  .  5 
pixels  high  in  the  image;  and  (2)  the  search  ;,i- 
dtnv  is  aiting  i  single  line.  As  a  consequence, 
if  there  is  a  vertical  misalignment  of  ii  lines  be¬ 
tween  die  left  and  right  images,  n  points  on  one 
end  of  each  vertical  zero-crossing  segment  in  the 
left  image  cannot  match  with  the  correct  segment 
in  die  right  image.  This  reduces  the  portion 
of  zero-crossings  finding  matches  which  reduces 
the  fikcl'hood  that  the  statistics  checking  will  pass 
any  of  die  matches.  V.'orsc.  in  eases  where  the 
misalignment  is  small  enough  to  allow  most  can¬ 
didate  matches  through,  It  becomes  possible  for 
zero-crossing  points  at  the  ends  of  vertical  seg¬ 
ments  which  do  not  "see"  the  correct  segment  in 
the  other  image  to  .natch  to  the  next  segment  over 
and  not  be  rejected  by  the  statistics  module.  This 
later  ease  produces  sporatic  mismatches  with  large 
disparity  errors.  One  method— used  by  Grimson— 
for  reducing  this  problem  is  to  do  die  match¬ 
ing  twice  once  driven  from  the  left  image  and 
once  from  the  right  ami  then  taking  tally  those 
results  found  going  boili  ways.  This  works,  we 
think,  because  the  accidental  matches  tend  to  oc¬ 
cur  differently  in  the  two  eases  and  so  can  be 
eliminated. 

Vertical  misalignments  between  the  two  cam¬ 
eras  of  a  stereo  system  cannot  be  avoided.  This  is 
because  the  distances  from  a  point  in  the  field  of 
view  to  die  left  and  right  cameras  can  be  ditTert  .il 
giving  rise  to  slightly  different  scales  at  that  loca¬ 
tion  in  the  images.  This  effect  is  most  pronounced 
at  die  image's  left  and  right  edges  and  for  larger 
vcrgence  angles.  Table  1  show  some  worse  case 
estimates  for  die  magnitude  of  this  type  of  vertical 
misalignment  for  typical  vcrgence  angles  and  a 
1000  line  image  with  a  20  degree  field  of  view.  For 
example,  with  a  vcrgcnct  of  5  degrees— a  target  U 
feel  away  when  the  cameras  are  1  foot  apart— the 
expected  vertical  alignment  of  die  cameras  would 
be  within  plus  or  minus  7  lines  or  better. 

Thtiv  arc  several  ways  in  which  vertical  offsets 
between  the  left  and  right  images  could  be  measur¬ 
ed  locally,  so  we  are  examining  die  possibility 
of  a  device  for  correcting  misalignments  at  the 
zero-crossing  detection  stage  of  the  system.  Such 
a  device  has  to  be  after  die  convolver  if  rapid 
changes  in  alignment  am  to  be  handled  and  it 
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should  be  before  the  stereo-matcher.  Our  prelimi¬ 
nary  design  would  allow  some  generality  in  the 
type  of  zero-crossing  information  obtained  as  well 
ns  providing  offset  adjustment  over  an  8  to  16  line 

range. 


DISCUSSION 

We  arc  beginning  to  experiment  with  the 
first  part  of  our  hardware  system -the  camera- 
convolver  combination— with  an  emphasis  on  learn¬ 
ing  as  much  as  we  can  about  what  (his  real  time 
dimension  can  add  to  our  understanding  of  the 
stereo  matching  problem.  We  are  particularly  in¬ 
terested  in  die  degree  to  which  the  matcher  and 
statistics  modules  can  he  streamlined  by  taking  ad¬ 
vantage  of  die  repetitive  naiurc  of  the  retd  time 
system  in  conjunction  with  the  image  variations 
that  can  he  expected  between  frames  due  to  mo¬ 
tion  or  illumination  effects. 
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Figure  1.  Modules  of  the  hardware  stereo-matcher.  Heavy  arrows  indicate  pipelined  data  paths. 
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Figure  2.  Hie  serpentine  memory  and  digital  convolver  combination  on  the  I  isp  machine  Xbus.  The 
Xbos  is  a  J2  bit  bos.  Hitch  module  has  its  own  data  address— heavy  arrows — and  control  address — 
thin  arrows  on  this  bus.  In  memory  to  memory  operation,  a  micro  code  routine  reads  4  bytes  of 
image  data  from  an  army  in  memory,  writes  it  to  the  serpentine  memory's  ditto  address,  and  then 
reads  4  16  hit  bytes  from  the  convolver's  data  address  and  writes  them  to  an  output  array  in  l  isp 
machine  memory.  I  his  sequence  is  repeated  until  the  entire  input  army  has  been  scanned,  litis  loop 
runs  at  about  1 .5  ytsec  per  pixel  when  no  paging  is  required. 


laMc  1 

Vertical  Misahgamcut  Magnitudes  la  a  1000  Mae  Image  Due  to  Different  DUlnnces  from 
large!  tv  Cameras  at  10  Degree  Aaimuth* 
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d  is  camera  separation. 

Pi  and  Pr  are  distances  to  target  from  left  «nd  right  camera*. 
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|.  jg,ire  3.  The  linear  array  camera  prototype.  All  circuitry  for  the  camera  other  than  the  power  supply 
is  contained  within  the  camera  body  -show  here  uncovered.  Digital  output  is  ovet  a  5(1  tool  ubbon 


cable. 
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An  Iterative  Image  Registration  Technique 
with  an  Application  to  Stereo  Vision 
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Pittsburgh,  Pennsylvenle  15213 


Abstract 

Imege  registration  tinds  e  verlaty  o I  applications  In  computer 
vision.  Unfortunately,  traditional  Imege  registration  techniques 
tend  to  be  costly.  We  present  a  new  imago  registration  technique 
that  makes  use  of  the  spatial  intensity  giaoiont  o f  the  images  to 
find  a  good  match  using  a  type  of  Newton -Raphson  iteration.  Our 
technique  Is  teste-  T*r.vuae  it  examines  tar  fewer  potential 
maichas  betwtzn  the  images  than  existing  techniques, 
furthermore,  this  registration  technique  can  be  generalized  to 
handle  rotation,  scaling  and  shearing.  We  show  how  our 
technique  can  be  adapted  tor  use  In  a  stereo  vision  system. 


1.  Introduction 

Image  registration  finds  a  variety  of  applications  in  computer 
vision,  such  as  image  matching  for  stereo  vision,  pattern 
recognition,  and  motion  analysis.  Unfortunately,  existing 
techniques  for  image  registration  tend  to  be  cc-atly.  Moreover, 
they  generally  fail  to  deal  with  rotation  or  othei  distortions  of  the 
images. 

In  this  paper  we  present  a  new  image  registration  technique  that 
uses  spatial  intensity  gradient  information  to  direct  the  search  for 
the  position  that  yields  the  best  match.  By  taking  more 
information  about  the  images  into  account,  this  technique  is  able 
to  find  the  best  match  between  two  images  with  far  fe^ar 
comparisons  of  images  than  techniques  which  examine  the 
possible  positions  of  registration  in  some  fixed  order.  Our 
technkne  takes  advantage  of  the  fact  that  in  many  applications 
the  tv<o  Images  are  already  in  approximate  registration.  This 
technique  can  be  generalized  to  deal  with  arbitrary  linear 
distortions  of  the  image,  including  rotation,  We  then  describe  a 
stereo  vision  system  that  uses  this  registration  technique,  and 
suggest  some  further  avenues  for  research  toward  making 
effective  use  of  this  method  in  stereo  image  understanding. 


2.  The  registration  problem 

The  translational  image  registration  problem  can  be 
characterized  as  follows:  W#  are  given  functions  F(x)  and  G(x) 
which  give  the  respective  pixel  values  at  each  location  x  In  two 
images,  where  x  is  a  vector.  We  wish  to  find  the  disparity  vector  h 
which  minimizes  some  measure  of  the  difference  between  F(x  ♦  h) 

and  G(x),  for  x  in  some  region  of  Intel  eat  ft.  (Sea  tk>ure  1*. 


G(x) 


Figure  1 :  The  image  registration  problem 

Typical  measures  of  the  difference  between  F(x  +  h)  anc  G(x) 
era: 


Mx> 


•  L,  norm  -  2,e-,  |F(*  +  h)-G(x)| 

.  Lanorm  -  (2«„  (x  +  h)  -  G(x)]*)1/a 

*  negative  of  normalized  correlation 

-  EmtHf(x  +  h)G(x) 

”  ( 2* «  f(x  +  h)* )  wa(  G(x)J )1/a 

We  will  propose  a  more  general  measure  of  Image  difference,  of 
which  both  the  L2  norm  and  the  correlation  are  special  caste.  The 
L,  norm  is  chiefly  of  interest  as  an  inexpensive  approximation  to 
the  L2  norm. 
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3.  Existing  techniques 

An  obvious  technique  for  registering  two  imsgos  is  to  calculate 
a  measure  of  the  difference  between  the  Images  at  all  possible 
values  of  the  disparity  vector  h— that  is,  to  exhaustively  search  the 
apace  of  possible  values  of  h.  This  technique  Is  very  time 
consuming;  if  the  size  of  the  picture  G(x)  Is  NxN,  and  the  region  of 
possible  values  of  h  is  of  size  MxM,  then  this  method  requires 
0(M3N2)  time  to  compute. 

Speedup  at  the  risk  ol  possible  failure  to  find  the  best  h  can  be 
uchieved  by  using  a  hill-climbing  technique.  This  technique 
begins  with  an  initial  estimato  h0  of  the  disparity.  To  obtain  the 
next  quess  from  the  current  guess  hA,  one  evaluates  the 
difference  function  at  all  points  in  a  small  (say,  3x3)  neighborhood 
of  h*  and  takes  as  the  next  guess  h*  + ,  that  point  which  minimizes 
the  difference  function.  As  with  all  hill-climbing  techniques,  this 
method  suffers  from  the  problem  of  false  peaks;  the  local  optimum 
that  one  attains  may  not  be  the  global  optimum.  This  technique 
operates  in  O (MaN)  time  on  the  average,  for  M  and  N  as,  above. 

Another  technique,  known  as  the  sequential  similarity  detection 
algorithm  (SSCA)  [2],  only  estimates  the  error  for  each  disparity 
vector  h.  In  SSDA,  the  error  function  must  be  a  cumulative  one 
such  as  the  L,  or  La  norm.  One  stops  accumulating  the  error  for 
the  current  h  under  investigation  when  it  becomes  apparent  that 
the  current  h  is  not  likely  to  give  the  best  match.  Criteria  for 
stopping  include  a  fixed  threshold  such  that  when  the 
accumulated  error  exceeds  this  threshold  one  goes  on  to  the  next 
h,  and  a  variable  threshold  which  Increases  with  the  number  of 
pixels  in  R  whose  contribution  to  the  total  error  have  been  added. 
SSDA  leaves  unspecified  the  order  In  which  the  h's  are  examined. 

Note  that  it.  SSDA  if  we  adopt  as  our  threshold  the  minimum 
error  we  have  found  among  the  h  examined  so  far,  we  obtain  an 
algorithm  similar  to  alpha-beta  pruning  in  min-max  game  trees  (7], 
Hore  we  take  advantage  of  the  fact  that  in  evaluating 
minh  2,  d(x,h),  where  d(x,h)  is  the  contribution  of  pixaf  x  at 
disparity  h  to  the  total  error,  the  2,  can  only  Increase  as  we  look 
at  more  x's  (more  pixels). 

Some  registration  algorithms  employ  a  coarse-fine  search 
strategy.  See  [6]  for  an  example.  One  of  the  techniques 
discussed  above  is  used  to  find  the  best  registration  for  the 
images  at  low  resolution,  and  the  low  resolution  match  is  then 
used  to  constrain  the  region  of  possible  matches  examined  at 
higher  resolution.  The  coarse-line  strategy  is  adopted  implicitly  by 
some  image  understanding  systems  which  work  with  a  "pyramid" 
of  images  of  the  same  scene  at  various  resolutions. 

It  should  be  noted  that  some  of  the  techniques  mentioned  so  ,'ar 
can  be  combined  because  they  concern  orthogonal  aspec.s  of  the 
image  registration  problem.  Hill  climbing  and  exhaustive  search 
concern  only  the  order  in  which  the  algorithm  searches  tor  the 


beat  match,  and  SSDA  specifies  only  the  method  used  to  calculate 
(an  esttir-ite  of)  the  difference  function.  Titus  for  example,  one 
could  use  the  SSDA  technique  with  either  hill  climbing  or 
exhaustive  search.  In  addition  a  coarse -line  strategy  may  be 
adopted. 

The  algorithm  we  present  specifies  the  order  in  which  to  search 
the  space  of  possible  h's.  In  particular,  our  technique  starts  with 
an  Initial  estimate  of  h,  and  it  uses  the  spatial  intensity  gradient  at 
each  point  of  the  image  to  modify  the  current  estimate  of  h  to 
obtain  an  h  which  yields  a  better  match.  This  process  is  repeated 
in  a  kind  of  Newton-Raphson  iteration.  If  the  iteration  converges, 
It  will  do  so  in  O (M*  log  N)  steps  on  the  average.  This  registration 
technique  can  be  combined  with  a  coarse-fine  strategy,  since  it 
requires  an  initial  estimate  of  the  approximate  disparity  h. 


4.  The  registration  algorithm 

In  this  section  we  first  derive  an  intuitive  solution  to  the  one¬ 
dimensional  registration  problem,  and  then  we  derive  an 
alternative  solution  which  we  generalize  tc  multiple  dimensions. 
We  then  show  how  our  technique  generalizes  to  other  kinds  of 
registration.  We  also  discuss  implementation  and  performance  of 
the  algorithm. 


4.1 .  One  dimensional  case 

In  the  one-dimensional  registration  problem,  we  wish  to  find  the 
horizontal  disparity  ft  between  two  curves  F(x)  and  G(x)  -  F(x  +  ft). 
This  is  illustrated  In  figure  2. 


Figure  2:  Two  curves  to  be  matched 


Our  solution  to  this  problem  depends  on  a  linear  approximation 
to  the  behavior  of  F(x)  in  the  neighborhood  of  x,  as  do  all 
subsequent  solutions  in  thb  paper.  In  particular,  for  small  ft, 


F(x-t-ft)-F(x) 

ft 


G(x)-F(x) 


so  that 


G(x)-F(x) 

F'(x) 


(D 


(2) 


The  success  of  our  algorithm  requires  h  to  be  small  enough  that  to  find  the  h  which  minimizes  the  L,  norm  measure  ot  the 

this  approximation  is  adequate.  In  section  4.3  we  will  show  how  to  difference  between  the  curves: 

extend  the  range  of  h' s  over  which  this  approximation  is  adoquate 

by  smoothing  the  images.  E  .  2,  t^(*  +  h)  -  G(x)]* 


The  approximation  to  h  given  in  (2)  depends  on  x.  A  natural 
method  for  combining  the  various  estimates  of  h  at  various  values 
of  x  would  be  to  simply  average  them: 


h 


,  <3(x)-F(x) 
'*  F'(x) 


/2j. 


(3) 


We  can  improve  this  average  lay  realizing  that  th>  linear 
approximation  in  (1)  is  good  where  F{x)  is  near  y  lin  tar,  and 
conversely  is  worse  where  |F"{x)|  is  large.  Thus  we  could  weight 
the  contribution  of  elich  term  to  the  average  in  (3)  in  inverse 
p-oportion  iO  an  estimate  of  |F'(x)|.  One  such  estimate  is 


Since  our  estimate  is  to  be  used  as  a  weight  in  an  average,  wo  can 
drop  the  constant  factor  of  1/h  in  (4),  and  use  as  our  weighting 
function 


w(x) 


t 

jG'(x)-F(x)f 


<5) 


This  in  fact  tppeals  to  our  intuition:  for  example,  in  figure  2,  where 
the  two  curves  cross,  the  estimate  of  h  provided  by  (2)  is  0,  which 
is  bad;  fortunately,  the  weight  given  to  this  estimate  in  the  average 
is  small,  since  the  difference  between  F'(x)  and  G'(x)  at  this  point 
is  large.  The  average  with  weighting  is 


~  2* - Fw - /2xW(x)- 


where  w(x)  is  given  by  (5). 


Having  obtained  this  estimate,  we  can  then  move  F(x)  by  our 
estimate  of  h,  and  repeat  this  procedure,  yielding  a  type  of 
Newton -Raphson  iteration.  Ideally  our  sequence  of  estimates  of  h 
will  converge  to  the  best  h.  This  iteration  is  expressed  by 


To  minimize  the  error  with  respect  to  h,  we  set 


on 

-  2,2F'<x)lF(x)  +  hF'(x)-G(x)]t 


from  which 

.  1,  rwmw-rw 

- - STS? — ■  M 

This  is  essentially  the  same  solution  that  we  derived  in  (6),  but  with 
the  weighting  function  w(x)  a  F'(x)3.  As  we  will  see,  the  form  of 

the  linear  approximation  we  have  used  here  generalizes  tb  two  or 
more  dimensions.  Moreover,  we  have  avoided  the  problem  of 
dividing  by  0,  since  in  (9)  we  will  divide  by  0  only  if  F'(x)»  0 
everywhere  (in  which  case  h  really  is  undefined),  whereas  in  (3) 
we  will  divide  by  0  if  F'(x) »  0  anywhere. 

The  iterative  form  with  weighting  corresponding  to  (7)  is 


h0  «  0, 

2,  w(x)F'(x  +  f**)[G(x)-F(y  +  />*)) 

*+1  “  *  *  2x*(x)f'(x+V*  ’ 

where  w(x)  is  given  by  (5). 


(10) 


4,3.  Performance 

A  natural  question  to  ask  is  under  what  conditions  and  how  fast 
the  sequence  of  hk'e  converc  » to  the  real  h.  Consider  the  casn 

F(x)  ■  sinx, 


4.2.  An  alternative  derivation 
The  derivation  given  above  does  not  generalize  weft  to  two 
dimensions  because  the  two-dimensional  linear  approximation 
occurs  in  a  different  form.  Moreover,  (2)  is  undefined  where 
F'(x)  ■  0,  l.e.  where  the  curve  is  level.  Both  of  these  problems  can 
be  corrected  by  using  the  linear  approximation  cf  equation  (1)  in 
the  form 


F(x  +  h)  a  F(x)  +  hF'(x), 


(8) 


G(x)  «  F(x  +  h)  «  s<n(x  +  h). 

It  can  be  shown  that  both  versions  of  the  registration  algorithm 
given  above  will  converge  to  the  correct  h  for  |h|  <  *,  the*  Is,  for 
initial  misregistrations  as  large  as  one-half  wavelength.  This 
suggests  that  we  can  improve  the  range  of  convergence  of  the 
algorithm  by  suppressing  high  spatial  frequencies  in  the  image, 
which  can  be  accomplished  by  smoothing  the  image,  i.e.  by 
replacing  each  pixel  of  the  image  by  a  weighted  average  of 
neighboring  pixels.  The  tradeoff  is  thst  smoothing  suppresses 
unall  details,  and  thus  makes  the  match  less  accurate*.  If  the 
smoothing  window  is  much  larger  than  the  size  of  the  object  that 
we  are  trying  to  match,  the  object  may  be  suppressed  entirely,  and 
so  no  match  wilt  be  possible. 
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Sine*  towpess-tlltered  Image*  can  b*  sampled  at  lower 
resolution  with  no  less  of  information,  the  above  observation 
suggests  that  we  adopt  a  coarse-fine  strategy.  We  can  use  a  low- 
resolution  smoothed  version  of  the  image  to  obtain  an 
approximate  match.  Applying  the  algorithm  to  higher  resolution 
images  will  refine  the  match  obtained  at  lower  resolution. 

While  the  effect  of  smoothing  is  to  extend  the  range  of 
convergence,  the  weighting  function  serves  to  improve  the 
accuracy  of  the  approximation,  and  thus  to  speee  up  the 
convergence.  Without  weighting,  i.e  with  w(xj  ■  1,  the  calculated 
disparity  h,  of  the  first  iteration  of  (10)  with  F(x)  «  sin  x  falls  off  to 
zero  as  the  disparity  approaches  one-half  wavelength.  However, 
with  w(x)  as  in  (5),  the  calculation  of  disparity  is  much  more 
accurate,  and  only  falls  off  to  zero  at  a  disparity  vary  near  one-half 
wavelength.  Thus  with  w(x)  as  in  (5)  convergence  is  faster  for 
•srqe  disparities. 


4.4.  Implementation 

Implementing  (10)  requires  calculating  the  weighted  sums  of  the 
quantities  P G,  F''F,  and  (F')2  over  the  region  of  interest  A.  We 
cannot  calculate  F'(x)  exactly,  but  for  the  purposes  of  this 
algorithm,  we  can  estimate  it  by 

F(x  +  Ax)  -  F(x) 

™  * - T* - ’ 

and  similarly  for  G'(x),  where  we  choose  Ax  appropriately  small 
(e.g.  one  pixel).  Some  more  sophisticated  technique  could  be 
used  for  estimating  the  first  derivatives,  but  In  general  such 
techniques  are  equivalent  to  first  smoothing  the  function,  which 
we  have  proposed  doing  for  otrv.tr  reasons,  and  then  taking  the 
difference. 

4.5.  Generalization  to  multiple  dimensions 

The  one  dimensional  registration  algorithm  given  above  can  be 
generalized  to  two  or  more  dimensions.  We  wish  to  minimize  the 
Ls  norm  measure  of  error: 

E  -  S«fllF(*  +  h)-G(x)]a, 


-  r.,  2~{F(>0  +  n—  -  Q(X)), 
d*  d* 

from  which 


which  has  much  the  same  form  as  the  one  dimensional  version  in 

(6). 

The  discussions  above  of  iteration,  weighting,  smoothing,  and 
the  coarse-fine  technique  with  respect  to  the  one-dimensional 
case  apply  to  the  n  -dimensional  case  as  well.  Calculating  our 
estimate  ol  h  in  the  two-dimensional  case  requires  accumulating 
the  weighted  -sum  of  five  products  ((G  -  F)FX,  (G  -  F)Fy,  Fa,  Fa,  and 
FxFy)  over  the  region  A,  as  opposed  to  accumulating  one  product 
for  correlation  or  the  la  norm.  However,  this  Is  more  than 
compensated  for,  especially  in  high-resolution  images,  by 
evaluating  these  sums  at  fewer  values  of  h. 


4.5.  Further  generalizations 

Our  technique  can  be  extended  to  registration  between  two 
images  related  not  by  a  simple  translation,  but  by  an  arbitrary 
linear  transformation,  3uch  as  rotation,  scaling,  and  shearing. 
Such  a  relationship  Is  expressed  by 

G(x)  -  F(xA  +  h), 

where  A  is  a  matrix  expressing  the  linear  spatial  transforms) on 
between  F(x)  and  G(x).  The  quantity  to  be  minimized  in  this  i'  se 
is 


B  -  2x{F(xA  +  h)-G(x)]J. 

■v 

To  determine  the  amount  AA  to  adjust  A  and  the  amount  Ah  to 
adjust  h,  we  use  the  linear  approximation 


where  x  and  h  are  n-dimonsional  row  vectors.  We  make  a  linear 
approximation  analogous  to  that  in  (8), 

F(x  +  h)  =  F(x)  +  h^-F(x), 
dx 

where  d/dx  is  the  gradient  operator  with  respect  to  x,  as  a  column 
vector: 

3  m  >  8  3  8  it 

8x  3xt  8x2  8x„ 

Using  this  approximation,  to  minimize  E,  we  set 


F(x(A  +  A  A)  +  (h  +  Ah)j 

a  F(xA  +  h)  +  (xAA  +  Ah)^-F(x).  (1 1) 

ox 

When  we  use  this  approximation  the  error  expression  again 
becomes  qundratic  in  the  quantities  to  be  minimized  with  respect 
to.  Differentiating  with  respect  to  these  quantities  and  setting  the 
results  equal  to  zero  yields  a  set  of  linear  equations  to  be  solved 
simultaneously. 

This  generalization  is  useful  in  applications  such  as  stereo 
vision,  where  the  two  different  views  of  the  object  will  be  different 
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views,  due  to  lbs  difference  ol  th#  viewpoints  of  the  cameras  or  to 
differences  In  the  proceaaiitg  of  the  two  images.  If  wo  model  th(« 
dlfforonco  as  a  llnoor  transformation,  wa  have  (Ignoring  tho 
ragistratlon  problom  for  tho  moment) 

F(x)  ■  aQ(*)+ft, 

where  a  may  be  thought  of  as  a  contrast  adfustment  and  fi  as  a 
brightness  adjustment.  Combining  this  with  the  general  linear 
transformation  registration  problem,  we  obtain 

E  -  2*  lf(*A  +  h)-(«G<x) +  /»))* 

as  the  quantity  to  minimize  with  respect  to  a,  fi,  A,  and  h.  The 
minimization  of  this  quantity,  using  the  linear  approximation  in 
equation  (it),  is  straightforward.  This  is  the  general  form 
promised  in  section  2.  If  we  ignore  A,  minimizing  this  quantity  is 
equivalent  to  maximizing  the  correlation  coefficient  (see,  for 
example,  (3));  if  we  ignore  A  and  fi  as  well,  minimizing  this  form  Is 
equivalent  to  minimizing  tho  L?  norm. 


5-  Application  to  stereo  vision 

In  this  section  we  show  how  the  generalized  registration 
algorithm  described  above  can  be  applied  to  extracting  depth 
information  from  stereo  images. 

S.  1 .  The  stereo  problem 

The  problem  of  extracting  depth  information  from  a  stereo  pvir 
has  in  principle  four  components:  finding  objects  In  the  pictures, 
matching  the  objects  in  the  two  views,  determining  the  camera 
parameters,  and  determining  the  distances  from  the  camera  to  the 
objects.  Our  approach  Is  to  combine  object  matching  with  solving 
tor  the  camera  parameters  and  the  distances  of  the  objects  by 
using  a  form  of  the  fast  registration  technique  described  above. 

Techniques  for  locating  objects  include  an  interest  operator  (6], 
zeio  crossings  in  bandpass-filtered  images  [5],  and  linear  features 
[1].  One  might  also  use  regions  found  by  an  image  segmentation 
program  as  objects. 

Stereo  vision  systems  which  work  with  features  at  the  pixel  level 
can  use  one  of  the  registration  techniques  discussed  above. 
Systems  whose  objects  are  higher-level  features  must  use  some 
difference  measure  and  some  search  technique  suited  to  the 
particular  feature  being  used.  Our  registration  algorithm  provides 
a  stereo  vision  system  with  a  fast  method  of  doing  pixel -lev el 
matching. 

Many  steroo  vision  systems  concern  themselves  only  with 
calculating  the  distances  to  the  matched  objects.  One  must  also 
be  aware  that  in  any  real  application  of  stereo  vision  the  relative 
positions  of  the  cameras  will  not  be  known  with  perfect  accuracy. 


Oennery  (4)  has  shown  hnw  to  simultaneously  solve  tor  the 
camera  iwemetei  s  and  the  distances  of  objects. 

S.2.  A  mathematical  characterization 
The  notation  we  use  Is  Illustrated  In  figure  3.  Let  c  be  the  vector 
of  camera  parameters  that  describe  the  orientation  and  position  of 
camera  2  with  respect  to  camera  1'a  coordinate  system.  These 
parameters  are  azimuth,  elevation,  pan,  tilt,  and  roll,  as  defined  In 
(41.  Let  x  denote  the  position  of  an  Image  In  the  camera  1  film 
plane  of  an  object.  Suppose  the  object  is  at  a  distance  z  from 
camera  1.  Given  the  position  in  picture  1  x  and  distance  z  of  the 
ooject,  we  could  directly  calculate  the  position  p(x,z)  that  it  must 
have  occupied  In  three-space.  We  express  p  with  respect  to 
camera  I’s  coordinate  system  so  that  p  does  not  depend  on  the 
orientation  of  camera  1 .  The  object  would  appear  on  camera  2's 
film  plane  at  a  position  q(p,c)  that  is  dependent  on  the  object's 
position  In  three-space  p  and  on  the  camera  parameters  c.  Let, 
G(x)  be  the  intensity  value  of  pixel  x  In  picture  i ,  and  let  F(q)  the 
intensity  value  of  pixel  q  in  picture  2.  The  goal  of  a  stereo  vision 
system  is  to  invert  the  relationship  described  above  and  solve  for  c 
and  z,  given  x,  f  and  G. 


5.3.  Applying  the  registration  algorithm 
First  consider  the  case  where  we  know  the  exact  camera 
parameters  c,  and  we  wish  to  discover  the  distance  z  of  an  ooject. 
'Juppose  we  have  an  estimate  of  the  distance  z.  We  wish  to  see 
what  happens  to  the  quality  of  our  match  between  F  and  G  as,  we 
vary  z  by  an  amount  Az.  The  linear  approximation  that  we  use 
here  is 

3  F 

F(z  +  Az)  a  F(z)  +  Az— , 

3z 

where 

dF  dp  3q  3 F 

- 1  - - ,  (12» 

3 1  3z  3p  3q 

This  equation  is  due  to  the  chain  rule  of  the  gradient  operator; 
3q/3p  is  a  matrix  of  partial  derivatives  of  the  components  of  q 
with  respect  to  the  components  of  p,  and  3F/3q  Is  the  spatial 
intensity  gradient  of  the  image  F(q).  To  update  our  estimate  of  z, 
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m  want  to  find  the  Az  which  satieties 


°  '  WzE 

Solving  for  Az,  we  obtain 


*  -  x-S*-**^ 

where  9F/9z  la  given  by  (12). 


On  the  other  hand,  suppose  we  know  the  distances 

I « 1 ,2 . n,  of  each  of  n  objects  from  camera  1 ,  but  we  .  on’t  know 

the  exact  camera  parameters  c.  We  wish  to  determine  the  effect 
of  changing  our  estimate  of  the  camera  parameters  by  an  amount 
Ac.  Using  the  linear  approximation 

dq  dF 

F(c  +  Ac)  a  F{ c)  +  Ac-2-—, 

9c  dq 

we  solve  the  minimization  of  the  error  function  with  respect  to  Ac 
by  setting 


—  2,2xtR([F<c  +  Ac)-G]2 


obtaining 


As  with  the  other  techniques  derived  in  this  paper,  wsighting  and 
iteration  improve  the  solutions  for  Az  and  Ac  derived  above. 


5.4.  An  Implementation 

We  have  implemented  the  technique  described  above  in  a 
system  which  functions  well  under  human  supervision.  Our 
program  is  capable  of  solving  for  the  distances  to  the  objects,  the 
five  camera  parameters  described  above,  and  a  brightness  and 
contrast  parameter  for  the  entire  scene,  or  any  subset  of  these 

parameters.  A3  one  would  expect  from  the  discussion  In  section 
4.3,  the  algorithm  will  converge  to  the  correct  distances  and 
camera  parameters  when  the  initial  estimates  of  the  z/s  and  c  are 
sufuciently  accurate  that  we  know  the  position  in  the  camera  2  film 
plane  of  each  object  to  within  a  distance  on  the  order  of  the  size  of 
the  object. 

A  session  with  this  piogram  is  illustrated  in  figures  4  through  10, 
The  original  stereo  pair  is  presented  in  figure  4.  (Readers  who  can 
view  stereo  pairs  cross-eyed  will  want  to  hold  the  pictures  upside 
dovm  so  that  each  eye  receives  the  correct  view).  The  camera 
parameters  were  determined  independently  by  hand-selecting 
matching  points  and  solving  for  the  parameters  using  the  program 
described  in  [4], 


Figures  5  and  f)  are  bandpaavflltered  versions  of  tha  picture*  In 
figure  4.  Bandijans-flltnred  images  are  preferred  to  lowpaee- 
flltered  Images  in  finding  matches  because  very  tow  spatial 
frequencies  tend  to  be  a  result  of  shading  difference*  end  carry  no 
(or  misleading)  depth  information.  Tha  two  regions  enclosed  In 
rectangles  in  the  left  view  of  figure  5  have  been  hand-selected  end 
assigned  an  initial  depth  of  7.0  in  inita  of  the  distance  between 
cameras  If  these  were  the  actual  depths,  the  corresponding 
objects  would  to  found  in  the  right  view  at  the  position*  indicated 
figure  5.  After  seven  depth-adjustment  iterations,  tha  program 
found  the  matches  shown  in  tigur*  6.  The  distances  are  d.06  for 
object  1  and  5.66  for  object  2. 

Figures  7  end  8  i  re  bandpass-littered  with  a  band  one  octave 
higher  than  figures  6  and  6.  Five  new  points  have  been  hand- 
selected  in  the  left  view,  reflecting  the  different  features  which 
hove  become  visible  in  this  spatial  frequency  range.  Each  has 
been  assigned  an  initial  depth  equal  to  that  found  for  the 
corresponding  larger  region  in  figure  6.  The  predicted  poaltlon 
corresponding  to  these  depths  is  shown  in  the  right  view  of  figure 
7.  After  five  depth-adjustment  iterations,  tha  matches  shown  in 
figure  6  were  found.  The  correspondin'?  jepths  are  5.96  for  object 
1 ,  5.98  for  object  2, 5.77  for  object  3, 5.78  lor  object  4,  and  6.09  for 
object  8. 


Figures  9  and  10  are  bandpass-filtered  with  a  band  yet  another 
octave  higher  than  figures  7  and  8.  Again  five  new  points  have 
been  hand-selected  In  the  left  view,  reflecting  the  different 
features  which  have  become  visible  in  this  spatial  frequency 
range.  Each  has  been  assigned  an  initial  depth  equal  to  that 
found  for  the  corresponding  region  In  figure  8.  The  predicted 
position  cc  'responding  to  these  depths  is  shown  in  the  right  view 
of  figure  9.  After  four  depth-adjustment  Iterations,  the  matches 
shown  in  figure  10  were  found.  The  corresponding  depths  are 
5.97  for  object  1 , 5.98  for  object  2, 5.80  for  object  3, 5.77  for  object 
4,  and  5.98  for  oljoct  6. 


5.5.  Future  research 

Tha  system  that  we  have  implemented  at  present  requires 
considerable  hand-guidance  The  following  are  the  issues  we 
intend  to  investigate  toward  the  goal  of  automating  the  process. 

•  Providing  Initial  dep»h  estimates  for  oblects:  one  should  be  able 
to  u_o  approximate  depths  obtained  from  low-resolution  images 
to  provide  initial  depth  estimates  for  nearby  objects  visible  only 
at  higher  resolutions.  This  suggests  a  coarse-fine  paradigm  not 
just  for  the  problem  of  finding  individual  matches  but  for  the 
problem  of  extracting  depth  information  as  a  whole. 

•  Constructing  a  depth  map:  one  could  construct  a  depth  map 
from  depth  measurements  by  some  Interpolation  method,  end 
refine  the  depth  map  with  depth  measurement#  obtained  from 
successively  higher-resolution  views. 

•  Selecting  points  of  interest:  the  various  techniques  mentioned 
in  section  3  should  be  explored. 
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•  Tracking  sue1  *en  depth  changes:  the  sudden  depth  changes 
found  at  the  edges  of  objects  require  some  set  of  higher-level 
heuristics  to  keep  the  matching  algorithm  on  track  at  object 
boundaries. 

•  Compensating  for  the  different  appearances  of  objects  In  the 
two  views:  the  general  form  of  the  matching  algorithm  that 
allows  for  arbitrary  linear  transformations  should  be  useful  here. 
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ABSTRACT 

Over  the  past  three  years,  Lockheed  has  been 
working  In  navigation  of  an  autonomous  aerial 
vehicle  using  passively  sensed  Lnagec .  One  tech¬ 
nique  which  has  shown  promise  is  bootstrap  stereo, 
in  which  the  vehicle's  position  is  determined  from 
the  perceived  locations  of  known  ground  control 
points.  Successive  pairs  of  known  vehicle  camera 
positions  are  then  used  £c  locate  corresponding 
linage  paints  an  ttv>  ground,  creating  new  control 
points.  This  paper  describee  a  series  of  error 
simulations  which  have  been  performed  to  Investi¬ 
gate  the  error  propagation  as  the  number  of  boot¬ 
strapping  iterations  Increases. 


INTRODUCTION 

A  previous  paper  [ 1]  postulated  an  autonomous 
navigation  system,  called  the  Navigation  Expert, 
which  attempts  to  approximate  the  sophistication  of 
an  early  barnstorming  pilot.  This  expert  would 
navigate  partly  by  its  simple  instruments  (alti¬ 
meter,  airspeed  indicator,  and  attitude  gyros),  but 
raostl/  by  what  it.  could  see  of  che  terrain  below  it, 
A  major  component  of  the  Navigation  Expert  is  a 
technique  which  we  called  bootstrap  stereo. 

Given  a  set  of  ground  control  points  with 
known  real-world  positions,  and  given  the  locations 
of  the  projections  of  these  points  onto  the  image 
plane,  it  is  possible  to  determine  the  position  and 
orientation  of  the  tamer*  which  collected  the  Image. 
Conversely,  given  the  positions  and  orientations  of 
two  cameras  and  the  locations  of  corresponding 
point-pairs  in  the  two  inago  planes,  tiie  leal -world 
locations  of  the  viewed  ground  points  can  be 
determined  £2].  Combining  theee  two  techniques 
iteratively  produces  the  basis  tor  bootstrap  stereo. 

Figure  1  shows  an  Avotonomouc  Aerial  Vehicle 
which  has  obtained  images  At  three  points  is  its 
tiajectory.  The  bootstrap  st«r?o  process  bogiro 
with  a  set  of  landmark  points,  simplified  here  :o 
the  two  points  a  and  v,  whoa*  real. -world  coor¬ 
dinates  ate  known.  From  thetu,  tin  camera  position 
and  orientation  is  determined  for  the  Lmlge  frame 
taken  ft  Time  0.  Standard  image -matching  correla¬ 
tion  techniques  [3]  are  then  used  to  locate  these 
saute  points  In  the  second,  overlapping  frame  taken 


at  Time  1.  This  permits  the  second  canera  posi 
tion  and  orientation  to  be  determined. 


Because  the  aircraft  will  soon  be  out  of 
sight  of  the  known  landmarks,  new  landmark  points 
must  be  established  whenever  possible.  For  this 
purpose,  "intereating  points" — points  whose 
surrounding  information  indicates  a  high  likeli¬ 
hood  of  their  being  matchable  [c}--are  selected  in 
the  first  image  and  matched  in  the  second  image . 
Successfully  nmtehed  points  have  their  real-world 
locations  calculated  from  the  camera  position  and 
orientation  data,  thin  join  the  landmarks  list. 

In  Figure  1,  landmarks  c.  and  d  are  located  In 
this  manner  at  Time  1;  these  new  points  are  later 
used  to  position  the.  aircraft  at.  Tima  1,  Similarly, 
at  Time  2,  new  landmarks  e  and  C  join  the 
list;  old  landmarks  a  and  b,  which  avu  no 
longer  in  the  field-of  ~view;  are  dropped  from  the 
landmarks  list. 
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Figure  1  Navigation  Using  Bootstrap  Stereo. 
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Bootstrap  stereo,  then,  consists  of  tour 
components-- camera  calibration,  new  landmark 
point  selection,  point  matching,  and  control  point 
positioning.  All  of  these  components  are  well 
established  in  the  photogramme  try  and  image  pro¬ 
cessing  literature;  each  is  conceded  to  work 
reasonably  well.  What  was  not  known  was  how  the 
errors  would  accumulate  and  propagate  In  an 
iterative  application  such  as  bootstrap  stereo. 

This  paper  describes  a  series  of  error  simulations 
which  have  been  performed  '.o  investigate  the  error 
buildup  as  the  number  of  bootstrapping  iterations 
increases. 

ERROR  SIMULATIONS 

Ideally,  the  error  aualysis  of  a  new  tech¬ 
nique  would  be  performed  on  several  representative 
sequences  of  real  data  with  known  ground  truth. 

For  bootstrap  stereo,  this  would  require  a 
sequence  of  50  to  100  Images,  with  each  consecu¬ 
tive  pair  having  an  overlap  of  approximately  75%. 

In  addition,  a  set  of  known  ground  control  points 
visible  (and  recognizable)  In  the  first  image  of 
the  sequence  is  needed  to  initialize  the  technique. 
It  is  desirable  to  have  the  position  and  orienta¬ 
tion  of  the  camera  known  for  each  image;  this 
simplifies  determining  the  buildup  of  camera  po¬ 
sition  error.  Alternatively,  having  sets  of  known, 
visible,  recognizable  ground  control  points  in  every 
third  or  fourth  image  would  permit  calculation  of 
ground  point  position  errors,  which  should  be  equi¬ 
valent  to  camera  position  errors.  Finally,  it  is 
Imperative  that  the  spatial  distortion  which  the 
camera  induces  in  the  image  be  known,  either  ana¬ 
lytically  or  empirically,  so  that  the  image  pixel 
positions  can  be  corrected  for  this  distortion. 

In  the  course  of  developing  bootstrap  stereo, 
it  became  obvious  that  wa  did  not  have  a  data  set 
which  met  these  requirements,  nor  could  we  obtain 
one  within  our  available  time  and  financial 
resources.  Despite  this,  we  needed  to  document 
the  buildup  of  error  in  the  camera  and  ground-point 
positions  as  the  bootstrapping  progressed.  The 
only  solution  was  to  program  a  means  for  simulat¬ 
ing  a  flight,  thus  creating  data  on  which  the 
pieces  of  code  could  operate  as  they  would  for 
actual  boutstrapping. 

The  ideal  manner  in  which  to  do  this  would  be 
to  simulate  grey-level  imagery  of  a  realistic 
3 -dimensional  surface,  as  3een  from  arbitrary  view¬ 
points,  This  could  be  accomplished  by  crenting  a 
digital  model,  of  a  region  of  terraiu--complete 
with  features  such  as  vegetation,  roads,  and 
houses,  as  well  as  the  reflectance  properties  of 
each  part  of  the  model.  We  could  then  draw  up  a 
flight  path  over  the  simulated  terrain  and  calcu¬ 
late  the  set  of  imag&s  that  the  sinwlated  camera 
would  take  of  its  simulated  world  as  it  moved  along 
this  path.  These  images  could  then  be  fed  directly 
into  the  interestlng'polnt  program,  etc.,  and  we 
could  easily  compare  the  resulting  bootstrapped 
positions  to  the  simulated  flight  path.  This 
approach,  however,  was  deemed  to  be  Impractical  to 
implement  within  our  available  computational  and 
manpower  resources. 


We  therefore  re-examined  what  it  was  that  wc 
really  needed  to  simulate.  The  major  sources  of 
error  in  bootstrapping  come  from  errors  in  point 
matching  between  images  and  the  manner  in  which 
these  errors  perturb  the  numerical  analysis  and 
projective  geometry  of  the  camera  model  calcula¬ 
tions  and  the  ground  point  positioning.  If  we 
could  reasonably  simulate  the  matched  Image  points, 
complete  with  errors,  we  would  not  need  simulated 
grey- level  imagery. 

Given  this  simplification,  we  proceeded  with 
our  simulation.  We  first  created  a  digital 
terrain  model  by  constructing  a  planar  grid  below 
the  general  swath  of  the  flight  path,  then  used  a 
random  number  generator  to  create  elevations  at 
each  point  of  the  grid.  We  devised  a  flight  path 
by  flying  out  hypotehtical  aircraft  along  a  given 
vector,  taking  its  position  at  intervals,  and 
introducing  random  perturbations  in  the  postion 
and  orientation  of  the  aircraft  (hence  of  the 
camera)  at  each  step  along  the  way. 

For  each  camera  position,  we  performed  the 
necessary  projective  geometry  to  determine  where 
each  of  the  terrain  grid  points  fell  in  the  image; 
those  which  fell  outside  of  the  field-of-view  of 
the  camera  were  discarded.  F.ach  imase  ooint  was 
perturbed  by  a  random  amount  and/or  rounded  (e.g. 
to  the  nearest  pixel  or  1/10  pixel)  to  simulate  a 
match  error.  Image  points  were  tagged  with  the  ID 
numbers  of  the  grid  points  which  generated  them, 
so  that  points  in  two  images  could  be  matched 
symbolically.  (Figure  2  suraaarizes  the  parameters 
which  define  a  simulation). 

We  then  proceeded  to  run  the  bootstrapping 
programs  un  this  data  set.  The  programs  for  locat¬ 
ing  interesting  points  and  matching  them  from  image 
to  image  were  replaced  with  a  single  program  to 
retrieve  interesting  (i.e.  visible)  points  from  the 
image  point  files  and  match  them  from  file  to  file 
by  their  ID  numbers.  The  camera  position  calcula¬ 
tion  program  and  the  ground-point  positioning 
program  were  used  without  changes, 

RESULTS 

About  25  different  experiments  were  done 
using  the  simplified  simulation  described  above. 
These  investigated  the  effects  of  various  mission 
parameters  (such  as  camera  field-of-view,  pointing 
angle  with  respect  to  the  direction  flight,  etc.) 
as  well  as  the  effects  of  match  accuracy  on  the 
resulting  errors  in  the  camera  position  estimates. 

The  general  conclusions  from  these  simula¬ 
tions  are  that  the  distance  traveled  before  path 
estimation  error  becomes  unacceptable  can  be 
increased  by: 

1)  Increasing  the  camera  fleld-of-vlew  (Fig.  3) 

2)  Increasing  the  platform  stability  (Fig.  4) 

3)  Increasing  the  match  accuracy  (Fig.  5) 

4)  Using  backward -looking  imagery  (Fig.  b) 
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The  first  three  of  these  are  fairly  obvious. 
Increasing  the  field-of-view  increases  the  distance 
which  can  be  traveled  between  camera  shutterings 
while  still  maintaining  75%  overlap  in  the  data. 
Increasing  the  platform  stability  makes  it  less 
likely  that  wild  swings  in  the  pointing  angle  of 
the  camera  will  decrease  the  ground  coverage  over¬ 
lap,  which  increases  camera  positioner  inaccuracy. 
Increasing  the  match  accuracy  decreases  the  uncer¬ 
tainty  about  camera  and  ground-point  positions, 
allowing  error  to  build  up  more  slowly. 

That  backward -looking  Imagery  should  be  super- 
1  o  r  to  nadir  or  forward-looking  imagery  is  not 
intuitively  obvious.  To  understand  this,  consider 
the  two  major  ways  in  which  errors  enter  into 
camera  positioning.  If  the  set  of  ground  data 
points  and  their  corresponding  image  points  are 
slightly  inconsistent,  the  camera  calibrator  will 
make  an  error  in  the  camera  position  and  orienta¬ 
tion.  fti  the  other  hand,  if  a  pair  of  camera  posi¬ 
tions  and  of  matching  points  within  the  Images  are 
slightly  inconsistent,  then  the  rays  from  the 
camera  center  through  the  image  points  will  not 
intersect  precisely,  giving  an  error  in  that 
ground -poire  position.  Of  these  two  errors,  the 
ground-point  positioning  is  more  sensitive,  since 
it  depends  on  only  two  rays,  while  the  camera 
position  determination  uses  information  from  a 
larger  number  of  ground-point-to-cair-era  rays; 
the  redundancy  in  the  multiple  observations  greatly 
helps  in  reducing  the  error. 

Now  consider  the  geometry  involved  in  the 
forward-  and  backward- looking  cases.  When  the 
camera  is  looking  furward,  the  new  points  are  being 
placed  far  ahead  of  the  camera  by  means  of  a  very 
oblique  triangle  (Figure  7a),  where  a  small  error 
in  match  or  camera  orientation  can  cause  a  large 
error  in  the  ground-point  position.  When  the 
camera  is  looking  backward,  the  new  points  are 
being  placed  almost  directly  under  the  camera,  by 
means  of  a  nearly  equilateral  triangle  (Figure  7b) — 
the  most  favorable  geometry  for  minimizing  ground- 
point  position  error.  Nadir  Imagery  shares  this 
favorable  geometry,  but  suffers  because  of  the 
small  amount  of  visible  terrain.  Tipping  the  camera 
forward  or  backward  brings  more  terrain  into  the 
field-of-view,  permitting  longer  moves  between 
images.  Thus,  of  the  three  look  orientations, 
backward -looking  stereo  provides  the  best  combina¬ 
tion  of  conditions  to  maximize  the  distance  moved 
before  the  errors  become  unacceptable. 

The  obvious  tactical  question  is  how  far  can 
the  bootstrap  technique  fly  before  its  errors 
become  unacceptable.  That,  of  course,  will  depend 
on  the  flight  parameters.  We  ran  one  simulation  in 
which  all  parameters  were  favorably  set;  after 
flying  100,000  ft  (almost  20  miles),  the  position 
was  off  by  about  25  feet,  and  the  error  was  still 
accumulating  slowly.  We  do  not  know  how  far  this 
flight  could  have  gone  before  the  errors  became 
unacceptable,  as  this  is  the  longest  flight  we  have 
simulated. 


It  should  ba  mentioned  that  moat  of  our  simu¬ 
lations  did  NOT  include  any  use  of  the  instrumen¬ 
tation  on  cur  simulated  aircraft.  Of  course,  any 
reasonable  system  which  is  flown  will  hsve  atti¬ 
tude  and  altitude  Instruments  whose  readings  at 
the  times  the  camera  la  shuttered  will  be  avail¬ 
able  to  the  processing  system.  These  readings  can 
provide  good  Initial  values  for  the  solution  of  the 
highly  nonlinear  camera  equations  and  can  prevent 
bad  data  points  from  unduly  perturbing  the  solu¬ 
tion.  We  have  flown  one  simulation  using  postu¬ 
lated  instrument  readings  and  constraining  the 
camera  position  and  orientation  solution  to  lie 
near  these.  This  simulation  showed  a  5-fold 
Increase  in  distance  traveled  before  the  position 
became  inaccurate,  when  compared  with  a  similar 
run  without  the  instrumentation  and  constraints. 

CONCLUSIONS 

When  an  autonomous  aerial  vehicle  must  navi¬ 
gate  without  using  external  signals  or  radiating 
energy,  a  visual  navigator  la  an  enticing  possi¬ 
bility.  We  have  proposed  a  Navigation  Expert 
capable  of  emulating  the  behavior  of  an  early 
barnstorming  pilot  in  using  terrain  Imagery.  One 
tool  such  a  Navigation  Expert  could  use  is  boot¬ 
strap  stereo.  This  is  a  technique  by  which  the 
vehicle's  position  is  determined  from  the  per¬ 
ceive  1  positions  of  known  landmarks,  then  two 
known  camera  positions  are  used  to  locate  real- 
world  points  which  serve  as  new  landmarks. 

The  components  of  boctstrap  stereo  are  well 
established  in  the  photogramnetry  and  image  pro¬ 
cessing  literature.  We  have  combined  these,  with 
improvements,  into  a  workable  system  [l].  Simu¬ 
lation  results  on  the  error  buildup  in  the  system 
have  been  encouraging. 

Of  course,  these  are  still  only  simulation 
results.  Until  we  can  obtain  calibrated,  con¬ 
trolled  imagery  with  known  ground  truth  on  which 
to  run  bootstrapping,  then  compare  its  results 
with  a  simulation  having  the  same  parameters,  it 
will  be  difficult  to  tell  how  accurately  our  simu- 
<1 at ions  represent  the  bootstrapping  process. 
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CHARACTERISTICS 


PARAMETERS 


TERRAIN :  xt  y  Grid  points,  with  random 

elevations  above  a  level  plane 


PLIGHT  PATH:  Straight,  level  course,  with 

random  perturbations  in 
position 

CAMERA  ORIQT.ATIQN :  Fiicsd  with  respect  to  the  vehicle, 

with  random  perturbations  in 
vehicle  attitude 


SYKTHETIC  IMAGERY:  Image  plane  projections  of  terrain 

points,  with  random  perturbations 
in  image  plane  position 


PROCESSING:  Normal  camera  modeling  and 

ground  point  positioning,  with 
symbolic  image  point  matching 


•  Ground  grid  spacing 

•  Bat r  plana  alavatlon 

•  Amplitude  of  elevation 

•  Vehicle  elevation 

•  Amplitude  of  poaition 
perturbations 

•  Pitch  angle  of  camera  with 
respect  to  flight  path 

•  Amplitude  cf  attitude 
perturbations  (heading,  pitch, 
and  roll) 

•  Image  plane  also  (x,y) 

•  Field-of-viev  angle 

•  Imagery  overlap 

•  Amplitude  of  imago  plane 
perturbations 

•  Number  of  iterations  performed 


Fig.  2  Characteristics  and  Parameters  of  th_  Bootstrap  Stereo  Slnulatlons 
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Figure  3  Error  as  a  Fu  .ion  of  Field-of-View. 

Increasing  the  camera  field-of-view  increases 
the  distance  which  can  be  flown  before  errors 
become  unacceptable.  (In  this  and  subsequent 
simulations  presented  here,  elevation  *  1200 
ft,  distance  flown  was  varied  to  maintain  757. 
overlap). 
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Figure  4  Error  as  s  Function  of  Course  Stability. 

Increasing  the  platform  stability  decreases  the 
error  buildup,  especially  for  narrow  field-of- 
view  cameras. 
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Figure  5  Error  as  *  Function  of  Match  Accuracy. 

Increasing  the  natch  accuracy  decreases  the 
error  buildup. 


Figure  6  Error  as  a  Function  of  Look  Angle. 

Using  backward -looking  inagery  greatly 
Increases  the  distance  that  can  be  flown 
before  errors  becone  unacceptable. 
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Figure  7  Effect  of  Forward-  and  Backward- 
Looking  Castera. 

The  long,  narrow  triangle  obtelned  In  the  for¬ 
ward-looking  case  Is  note  sensitive  to  angular 
errors. 
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MODEL-BASED  THREE  DIMENSIONAL  INTERPRETATIONS 
OF  TWO  DIMENSIONAL  IMAGE8 
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AbitracL 

Acronym  is  a  comprehensive  domain  independ¬ 
ent  model-based  system  for  vision  and  manipulation 
related  tasks.  Many  of  its  sub-modules  and  repre¬ 
sentations  have  been  described  elsewhere.  Here  the 
derivation  and  use  of  invariants  for  image  feature 
prediction  is  described.  We  describe  how  predictions 
of  image  features  and  their  relations  are  made  and 
how  instructions  are  generated  which  tell  the  inter¬ 
pretation  algorithms  how  to  make  use  of  image  fea¬ 
ture  measurments  to  derive  three  dimensionalities  'and 
structural  and  spatial  constraints  on  the  original  three- 
dimensional  models.  Some  preliminary  examples  of 
Acronym’s  interpretations  of  aerial  images  are  shown. 

1.  Introduction. 

At  the  ARPA  IU  workshop  of  May  1980  we 
reported  on  a  number  of  conceptual  advances  proposed 
as  additions  to  the  Acronym  system.  These  have 
now  all  been  implemented,  and  further  extended  in  a 
number  of  cases.  The  major  perfomance  advantages 
are  that  now  Acronym  can  discriminate  instances  of 
various  classes  of  modest,  and  extract  three  dimensional 
information  from  monocular  images. 

To  support  these  devlopments  we  have  added  a 
class  «nd  subclass  relation  representation  scheme  to 
the  geometric  modeling  system  ([7J,  (5j).  This  is  based 
on  the  use  of  symbolic  algebraic  constraints.  In  sup¬ 
port  of  this  a  constraint  manipulation  systems  which 
includes  a  partial  decision  procedure  on  consistency 
of  sets  of  non-linear  inequalities  was  formulated  and 
implemented  [5].  A  geometric  reasoning  system  which 
can  deal  with  underconstrained  spatial  relations  was 
developed  [6j.  A  new  matcher  which  could  manipulate 
the  constraint  systems  was  built  for  interpretation  [7]. 
All  of  these  systems  were  implemented  in  a  mixture  of 
Maclisp  and  a  new  rule  system  built  for  the  purpose. 

We  have  thus  moved  from  a  purely  geometric 
representation  and  qualitative  geometric  reasoning  sys¬ 
tem  to  a  system  with  a  combined  algebraic  and 
geometric  representation  and  a  geometric  reasoning 
system  which  can  make  precise  deductions  about  par¬ 
tially  specified  situations.  The  geometric  and  algebraic 


aspects  of  the  representation  complement  each  othar 
during  interpretation. 

In  this  paper  we  deal  with  the  techniques  de¬ 
veloped  for  image  feature  and  feature-relation  predic¬ 
tion,  and  then  give  some  first  examples  (February 
1981)  of  the  performance  of  the  new  incarnation  of 
Acronym  on  some  images.  The  low  level  processes 
we  currently  use  provide  either  little  or  noisy  data. 
Nevertheless  Acronym  makes  strong  and  accurate 
deductions  about  the  obejcts  appearing  in  the  images. 
We  expect  even  better  performance  when  more  ac¬ 
curate  low  level  descriptive  processes,  become  available. 


Ik  Ercdbtlan. 

In  the  Acronym  system  generic  object  classes 
and  specific  objects  are  represented  by  volumetric 
modols  based  on  generalised  cones  along  with  a  par¬ 
tial  order  on  sets  of  non-linear  algebraic  inequalities 
relating  model  parameters.  Image  features  and  rela¬ 
tions  betwt  an  them  which  are  invariant  over  variations 
in  the  models  and  camera  parameters  are  identified 
by  a  geometric  reasoning  system.  Such  predictions 
are  combined  first  to  give  guidance  to  low  level  image 
description  processes,  then  to  provide  coarse  filters  on 
image  features  which  are  to  be  matched  to  local  predic¬ 
tions.  Predictions  also  contain  instructions  on  how 
to  use  noisy  measurements  from  identified  imago  fea¬ 
tures  to  construct  algebraic  constraints  on  the  original 
t  hree  dimensional  models.  Local  matches  are  combined 
subject  both  to  consistently  meeting  predicted  image 
feature  relations,  and  to  the  formation  of  consistent 
sets  of  algebraic  constraints  derived  from  the  image. 
The  result  is  a  three  dimensional  interpretation  of  the 
image. 

This  section  describes  tome  of  the  invariants  that 
are  identified  by  the  reasoning  system,  and  gives  ex¬ 
amples  of  how  the  back  constraints  are  set  up  giving 
three  dimensional  information  about  the  instances  of 
the  models  which  appear  in  images. 

2.1  Constraints 

To  illuminate  the  discussion  in  succeeding  sub¬ 
sections  we  briefly  describe  the  uses  and  capabilities 
of  Acronym’s  constraint  mechanism  and  the  allowed 
structure  of  constraints  themselves. 
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Acronym'*  three-dimensional  models  are  rep¬ 
resented  by  units  and  s/ots  (e.g.  Bobrow  and  Winograd 
(Si).  Any  slot  which  admits  numeric  filler*  also  ad¬ 
mits  quantifiers  (predeclared  variable  names)  and  ex¬ 
pressions  over  quantifcr*  using  the  operators  4* ,  — »  Xi 
/  and  y/. 

Constraints  can  be  put  on  quantifiers.  They 
take  the  form  of  inequalities  between  expressions  as 
defined  above,  along  with  the  possibility  of  including 
max  and  min  (on  the  left  and  right  of  <,  respectively). 
Equality  can  be  encoded  as  two  inequalities.  For  in¬ 
stance  suppose  a  cylinder  is  represented  as  a  general¬ 
ised  cone  whose  straight  spine  has  its  length  defined 
by  the  quantifier  CYL_LENGTH  and  whose  cross  section 
is  a  circle  with  radius  CYL_RADIU8.  Then  the  class  of 
all  cylinders  of  volume  5  (in  some  units)  can  be  repre¬ 
sented  by  the  two  constraints: 

5  >  CYL  LENGTH  X  CYL_RADIUS  X  CYL.  RADIUS  X  * 
5  <  CYL_LENGTH  X  CYL_RADIUS  X  CYL_RADIU8  X  * 

The  Acronym  constraint  manipulation  system 
(cms),  described  in  detail  in  (5),  operates  on  sets  of  con¬ 
straints.  A  set  of  constants  (implicitly  conjunctive) 
de unes  a  subset  of  n-dimensional  space,  where  n  is  the 
number  of  quantifiers  mentioned  in  the  constraint  set, 
which  is  the  set  of  points  for  which  all  constraints  are 
true.  This  is  called  the  satisfying  set,  and  is  empty  if 
the  constraints  are  inconsistent.  The  cms  is  used  for 
three  tasks  related  to  this  constraint  set. 

1.  Given  a  set  of  constraints  partially  decide  whether 
their  satisfying  set  is  empty.  The  outcomes  are 
“empty”  or  “/  don’t  knew”. 

2.  Find  numeric  (or  rfcoo)  upper  and  lower  bounds  on 
an  expression  in  quantifers  over  the  satisfying  set  of 
a  constraint  set.  This  uses  procedures  called  8UP  and 
INF. 

3.  (A  generalisation  of  2.)  For  and  expression  E  and 
a  set  of  quantifiers  V"  find  expressions  L  and  H  in  V 
such  that  L  <  E  <  H  identically  over  the  satisfying 
set  of  the  constraint  set. 

In  2  and  3  the  expressions  being  bounded  can 
include  trigonometric  functions  such  as  sin,  cos  and 
arcsin.  The  cms  we  have  implemented  in  Acronym 
is  a  non-linear  generalisation  of  the  linear  SUP-INF 
method  described  by  Bledsoe  and  Shostak  [9].  It 
behaves  identically  to  that  deSv  bed  by  the  latter  for 
purely  linear  sets  of  constraints  and  linear  expressions. 
In  addition  it  can  often  produce  good  bounds  (numeric 
and  expressions)  on  highly  non-linear  expressions  in 
the  presence  of  many  non-linear  constraints. 

2.2  Shape  prediction.. 

We  predict  shapes  as  ribbons  (the  two  dimen¬ 
sional  analogue  of  three  dimensional  generalised  cones) 
and  ellipses.  These  are  also  the  features  which  are 
found  by  the  low  level  descriptive  process  we  are  tem¬ 
porarily  using  in  Acronym. 

Ribbons  are  a  good  way  of  describing  the  images 
generated  by  generalised  cones.  Consider  a  ribbon 


which  coriesponds  to  the  image  of  the  swept  surface 
of  a  generalised  cone.  For  straight  spines,  the  projec¬ 
tion  of  the  cone  spine  into  the  image  would  closely  cor¬ 
respond  to  the  spine  of  the  ribbon.  Thus  a  good  ap¬ 
proximation  to  the  observed  angle  between  the  spines 
of  two  generalised  cones  is  the  angle  between  the  spines 
of  the  two  ribbons  in  the  image  corresponding  to  their 
swept  surfaces.  We  do  not  have  a  quantitative  theory 
of  these  correspondences.  Ellipses  are  a  good  way  of 
describing  the  shapes  generated  by  the  ends  of  general¬ 
ised  cones.  The  perspective  projections  of  ends  of  cones 
with  circular  cross-sections  are  exactly  ellipses. 

Shape  prediction  involves  deciding  what  shapes 
will  be  visible,  predicting  ranges  for  shape  parameters 
(to  be  used  as  a  coarse  filter  during  interpretation  and 
also  to  guide  the  tow  level  descriptive  processes)  and 
deriving  instructions  about  how  to  locally  invert  the 
perspective  transform  and  hence  use  image  measure¬ 
ments  to  generate  constraints  on  the  original  three 
dimensional  models. 

To  predict  the  shapes  generated  by  a  single 
generalised  cone,  we  do  not  explicitly  predict  all  pos¬ 
sible  qualitatively  different  viewpoints.  Rather  we 
predict  what  shapes  may  appear  in  the  image,  and 
associate  with  them  methods  to  compute  constraints 
on  the  model  that  are  implied  by  their  individual  ap¬ 
pearance  in  an  the  image.  For  example  identification  of 
the  image  of  the  swept  surface  of  a  right  circular  cone 
constrains  the  relative  orientation  of  the  cylinder  to  the 
camera  (we  call  these  back  constraints).  Identification 
of  an  end  face  of  the  cylinder  provides  a  different  set  of 
constraints.  If  both  the  swept  surface  and  an  end  face 
are  identified  then  both  sets  of  constraints  apply.  We 
also  predict  specific  relations  between  shapes  that  will 
be  true  if  they  are  both  observed  correctly.  For  more 
complex  cones,  the  payoff  is  even  greater  for  predict¬ 
ing  individual  shapes  rather  than  exhaustive  analysis 
of  which  shapes  can  appear  together. 

At  other  times  during  prediction  invariant  cases 
of  obscuration  are  noticed.  For  instance  it  may  be 
noticed  that  one  cone  abuts  another  so  that  its  end 
face  will  never  be  visible.  The  consequences  of  such 
realizations  are  propagated  through  the  predictions. 

Prediction  of  shapes  proceeds  in  five  phases. 
First,  all  the  contours  on  a  generalized  cone  which 
couid  give  rise  to  image  shapes  are  indeutifled  by  a  set 
of  special  purpose  rules.  These  include  occluding  con¬ 
tours  and  contours  due  purely  to  internal  cone  faces. 
Thus  for  instance  a  right  square  cylinder  will  generate 
contours  for  the  end  faces,  the  swept  faces,  and  con¬ 
tours  generated  by  the  swept  edges  at  diagonally  ver¬ 
tices  of  the  square  cross  section.  The  contours  are 
generated  independently  of  camera  orientation,  and  in 
terms  of  object  dimensions  rather  than  image  quan¬ 
tities. 

The  orientation  of  the  generalised  cone  relative 
to  the  camera  (this  is  done  by  the  geometric  reasoning 
system,  see  [G],  [5))  is  then  examined  to  decide  which 
contours  will  be  visible  and  how  their  image  shapes 
will  be  distorted  over  the  range  of  vai iations  in  the 
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model  parameters  which  appear  in  the  orientation  ex¬ 
pressions. 

The  third  phase  predicts  relations  between  con¬ 
tours  of  a  single  generalised  cone  (see  section  2.3). 

The  actual  shapes  are  then  predicted.  The  ex¬ 
pected  values  for  shape  parameters  in  the  image  are 
estimsted  as  closed  intervals  (see  below).  Finally  the 
back  constraints  which  will  be  instantiated  during  in¬ 
terpretation  are  constructed. 

2.2.1  Hack  constraints.  Suppose  that  we  wish 
to  predict  the  length  of  an  image  feature  which  is 
generated  by  something  of  length  I  lying  in  a  plane 
parallel  to  the  camera  image  plane,  at  distance  d  from 
the  camera.  Furthermore  suppose  the  camera  has  a  fo¬ 
cal  ratio  of  /.  Then  the  length  of  the  image  feature  it 
given  by  p  =  (l  x  f)/d.  Any  or  all  of  1,  /  and  d  may  be 
expressions  in  quantifiers,  rather  than  numbers.  Using 
the  cms  we  can  obtain  bounds  on  the  above  expression 
for  image  feature  length,  giving  that  it  will  lie  iD  some 
range  P  ~  Ipi.Ph]  where  pi  and  pK  are  either  numbers 
or  ioo,  For  more  complex  geometries  the  expression 
for  p  will  be  more  complex,  out  the  method  is  the  some 
(triguuumetric  functions  are  usually  involved). 

Now  given  an  image  feature,  which  is 
hypothesised  to  correspond  to  the  prediction  we  have 
to  decide  whether  it  acceptable  on  the  basis  of  its 
parameters.  The  low  level  descriptive  processes  are 
noisy  and  provide  an  error  interval,  rather  than  an 
exact  measurement  for  image  parameters.  Suppose 
the  interval  is  M— (nt|,mh]  for  a  feature  parameter 
predicted  with  expression  p.  Then  the  parameter  is 
acceptable  if  P  f]  M  is  non-empty.  This  is  the  coarse 
filtering  used  during  initial  hypothesis  of  image  feature 
to  feature  prediction  matches. 

But  note  also  that  it  must  be  true  that  the  true 
value  of  p  for  the  particular  instance  of  the  model  which 
is  being  imaged  must  lie  in  *>.e  range  M.  Thus  we  can 
add  the  constraints: 

mi  <  (l  X  f)/d 
mh  >  (I  X  f)ld 

to  the  instance  of  the  model  being  hypothesised,  where 
l,  f  and  d  are  numbers  or  expressions  in  quantifiers. 

2.2.2  Trigonometric  back  constraints.  When 
the  expression  p  involves  trigonometric  functions  the 
above  method  of  generating  bu.k  constraints  will  not 
work.  It  would  generate  constraints  involving  trogon- 
metric  functions,  which  our  cms  can  not  handle. 

One  approach  to  this  problem  is  to  bound  ex¬ 
pression  p  above  and  below  by  expressions  involving 
no  quantifiers  contained  in  arguments  to  trigonometric 
functions,  and  then  use  these  expressions  in  setting  up 
the  back  constraints,  This  has  the  unfortunate  side 
effect  of  losing  all  information  implied  by  the  Image  fea¬ 
ture  about  the  quantifiers  eliminated  from  the  bounds. 

A  second  approach  is  sometimes  applicable.  If 
a  trigonometric  function  has  as  its  argument  e,  an  ex¬ 
pression,  and  if  the  cms  determines  that  t  is  bounded 


to  lie  within  a  region  of  the  function’i  domain  where  it 
is  strictly  monotonic  and  hence  invertible,  then  specific 
back  constraints  on  e  can  be  compvited  at  interpreta¬ 
tion  time  (as  distinct  from  during  prediction).  We 
illustrate  with  an  example.  A  cylinder  with  length 
CYL.LENGTH  is  sitting  upright  on  a  table.  A  camera 
with  unknown  but  constrained  pan  and  tilt  (the  latter 
is  constrained  to  lie  in  the  interval  [»r/12,  x/6])  is  look¬ 
ing  across  from  the  side  of  the  table,  and  it  is  elevated 
above  tabic  top  height.  The  geometric  details  and 
numeric  constants  are  not  important  here.  Suffice  it  to 
say  that  the  geometric  reasoning  system  deduces  that 
the  pan  of  the  camera  is  irrelevant  to  the  prediction 
of  the  length  of  the  ribbon  corresponding  to  the  swept 
surface  of  the  cylinder.  It  predicts  that  the  length  of 
the  ribbon  in  the  image  will  in  fact  be: 

—2.42  X  CYL_LENOTH  X  cos(-TILT) 
CYLINDER. CAUZ 

where  2.42  is  the  focal  ratio  of  the  camera  and 
CYLINDER. CAME  is  an  internal  quantifier  generated  by 
the  prediction  module. 

Both  of  the  above  approaches  are  used  to 
generate  back  constraints  to  ensure  coverage  of  all  the 
relevant  quantifiers.  They  are: 

mh  >  —2.096  X  CTLLEKOTH  X  (1/CYLIlDEA.Cm) 
m  (  <,  -2.338  X  CTLJ.ENOTR  X  (1/CTLINDRR.CAME) 

— TXI.T  <  —  srcco»(iup(—  0.413  X 

X  CYLINDER. CAUZ  X  (1/CTLJJWOTa))) 
—TILT  ^  —  »rccosOnf(— 0.413  X  mj 

X  CYLINDER. CAUZ  X  (l/CYLJXNOTR))) 

The  first  two  are  non-trigonometric  back  constraints 
and  at  interpolation  time  a  simple  susbsitution  of  the 
measured  numeric  quantities  for  mt  and  mn  is  done, 
7’he  tatter  two  require  further  comuptation  at  inter¬ 
pretation  time.  After  the  substitution,  expressions 
must  be  bounded  over  the  satisfying  set  of  all  the 
known  constraints,  and  the  function  arccos  applied  to 
give  numeric  upper  and  lower  bounds  on  the  quantifier 
TILT. 

The  ter  hniques  described  here  work  for  &  more 
general  class  *  ''unctions  than  trigonometric  functions 
(in  the  curren  plementation  of  Acronym  we  use 
it  for  functions  ,n,  cos  and  arcsin).  The  requirement 
is  that  the  domain  of  the  function  (e.g.  the  interval 
[— jr,  ir]  for  sin  and  cos),  can  be  subdivided  into  a  finite 
number  of  intervals  over  which  the  function  is  strictly 
monotonic,  and  hence  locally  invertible. 

2.3  Feature  relation  prediction. 

Image  feature  (thape)  predictions  are  organised 
as  the  nodes  of  the  prediction  graph.  The  arcs  of 
the  graph  predict  image-domain  relations  between  the 
features.  During  interpretation  pairs  of  hypothesised 
image  feature  to  predicton  node  matches  are  coarsely 
checked  for  consistency  by  attempting  to  instantiate 
the  predicton  arcs.  Some  arcs  also  include  back  con¬ 
straints  which  the  instantiation  of  the  arc  implies  about 
the  model.  These  are  treated  in  exactly  the  same  man¬ 
ner  «r  those  associated  with  image  feature  predictions. 
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Prediction  arc*  are  generated  to  relate  multiple 
shapes  predicted  for  a  single  cone.  For  instance  a 
right  circular  cylinder  prediction  includes  shapes  for 
the  swept  surface  and  perhaps  each  of  the  end  faces 
(depending  on  whether  the  camera  geometry  is  known 
well  enough  to  determine  a  priori  exactly  which  faces 
will  oe  visible).  It  can  be  predicted  that  a  yisble  end 
face  will  be  co-incident  at  at  least  one  point  in  the 
image  with  a  visible  swept  surface.  (In  fact  a  stronger 
prediction  can  be  made:  the  straight  spine  of  the  swept 
surface  image  ribbon  can  be  extended  through  the  cen¬ 
ter  of  mass  of  the  elliptical  image  of  the  end  face.) 

Prediction  arcs  are  also  generated  between 
shapes  associated  with  predictions  for  different  general¬ 
ised  cones.  These  are  actually  of  more  importance  in 
arriving  at  a  consistent  global  interpretation  of  collec¬ 
tions  of  image  features  as  complex  objects, 

The  semantics  of  the  arc  types  we  currently  use 
are  as  follows, 

2iS.dLKxcLujdxs.  If  a  generalised  cone  has  a 
straight  spine  and  during  sweeping  the  cross  section  is 
kept  at  a  constant  angle  to  the  spine  the  at  most  one 
of  the  cones  end  faces  can  be  visible  in  a  single  image. 
Exclusive  arcs  relate  image  features  which  are  mutually 
exclusive  for  this  or  other  reasons.  (Note  that  in  the 
case  instantiations  of  the  two  end  faces  would  probably 
result  in  inconsistent  back  constraints  being  applied  to 
the  the  spatial  orientation  of  the  original  model,  so 
that  eventually  the  cms  would  detect  an  inconsistency. 
However  checking  for  the  existence  of  a  simple  arc  at 
an  early  stage  is  computationally  much  cheaper  Ur  \ 
waiting  to  invoke  the  decision  procedure.) 

2Jj5LCoJIr»s&r.  If  two  line  segments  in  three- 
space  are  co-J/ncar  then  any  two-space  image  of  them 
wil'.  either  be  a  single  degenerate  point  or  two  co-linear 
line  segments.  As  wo  pointed  out  in  the  previous  sec¬ 
tion  a  the  spine  of  the  image  shape  corresponding  to  the 
swept  surface  of  a  cone  is  usually  a  good  approxima¬ 
tion  to  the  projection  of  the  spine  of  the  cone  into  the 
image.  Thus  if  two  cones  are  known  to  invariantly  have 
co-linear  spines  in  three  dimensions  a  co-linear  spine 
arc  between  the  predictions  of  their  swept  surfaces  can 
be  included, 

1.3.3  Co-incident.  If  two  cones  are  physically 
co-incident  at  some  point(s)  in  tbree-space  then  for  any 
camera  geometry,  if  they  are  both  visible  their  projec¬ 
tions  will  be  co-incident  at  some  point(s)  (except  for 
some  cases  of  obscuration).  Failure  to  match  predicted 
co-incident  arcs  turns  out  to  be  the  strongest  pruning 
process  during  image  interpretation. 

2.3.4  Angle,  If  the  angle  between  the  spines 
of  twcTgeneralised  cones  as  viewed  from  the  modeled 
camera  is  invariant  over  alt  the  rotational  variations 
in  the  model  (e.g,  wing- wing  and  wing-fuselage  angles 
when  an  aircraft  is  viewed  from  above  —  this  is  because 
the  only  rotational  freedom  of  an  aircraft  on  the  ground 
i$  about  an  azis  parallel  to  the  direction  of  view  of  an 
overhead  camera),  or  if  an  expression  for  the  observed 
angle  can  be  symbolically  computed  and  is  suflciently 
simple,  then  a  prediction  of  the  observed  angle  can  be 


made.  Again  the  fact  the  projections  of  model  spine 
correspond  to  image  spines  is  used  here.  This  arc  type 
includes  (trigonometric)  back  constraints  which  make 
use  of  the  observed  angle.  Some  such  constraints  con¬ 
strain  relative  spatial  orientations  of  generalised  cones. 
Others  provide  constraints  on  the  orientation  of  the 
plane  of  rotation,  which  generated  the  angle,  relative 
to  the  camera,  and  hence  constraints  on  an  object's 
orientation  relative  to  the  camera. 

2.3.5  Projection,  Suppose  one  cone  B  is  affixed 
at  one  end  of  its  spine  to  another  A  somewhere  along 
its  length.  Th>?  spines  need  not  be  co-incident,  but  the 
cones  must  be.  Then  the  normal  projection  (we  are  not 
talking  about  projection  in  the  imaging  sense  hero)  of 
the  spine-end  of  B  onto  the  spine  of  A  defines  a  ratio 
between  distance  along  the  spine  of  A  and  the  spine 
length  of  A,  If  the  spines  of  A  and  B  are  both  observ¬ 
able  then  the  same  projection  in  the  imuge  is  invariant 
over  aii  possible  camera  orientations.  For  example  the 
ratio  of  the  distance  from  the  rear  of  the  fueslage  to 
the  point  of  wing  attachment,  to  the  length  of  the 
fuselage,  is  invariant  over  all  viewing  angles.  Again 
we  rely  on  the  r.oresspondences  between  the  projection 
(other  sense  here)  of  a  cone  spine  and  the  spice  of  the 
ribbon  generated  by  the  image  of  its  swept  surface. 
Projection  arcs  arc  only  generated  for  pairs  of  image 
features  which  have  a  co-incident  arc.  They  provide 
back  constraints  on  the  model  via  the  symbolic  expres¬ 
sion  which  describes  the  modeled  spine  projection  ratio. 

2.3.0  Distance.  Sometimes  symbolic  expressions 
for  the  image  distance  between  two  image  features  can 
computed.  Distance  arcs  are  only  generated  for  pairs  of 
image  features  which  aUo  have  an  angle  arc,  but  no  co¬ 
incident  arc.  Distance  arcs  generate  back  constraint* 
on  the  original  model. 

2.3.7  Itlbbon-contains.  This  is  a  directed  arc 
type  which  relates  two  predicted  ribbons,  one  of  which 
will  two  dimensionally  contain  the  otk:  in  the  image. 
For  instance,  ribbon-contains  arcs  ace  built  between 
the  ribbon  predicted  from  the  occluding  contour  of 
a  generalized  cone  with  rectangular  cross- section,  and 
each  of  the  ribbons  generated  by  the  two  visible  swept 
faces. 


3.  Some  Image  Interpretations. 

At  the  time  of  writing  the  various  sub-systems 
have  only  been  running  together  for  about  two  weeks. 
The  image  interpretations  reported  here  are  therefore 
of  a  rather  preliminary  nature. 

In  the  examples  to  be  described  here  Acronym 
was  given  a  generic  model  of  wide-bodied  passenger 
jot  aircraft,  along  with  class  specializations  to  L-lOlli 
and  Boeicg-747s.  The  Boeing-747  class  had  further 
subclass  specialisations  to  Boeing-747B  and  Boeing- 
747SP.  The  subclasses  are  do  not  completely  parti¬ 
tion  their  puent  classes.  The  classes  are  described  by 
sets  of  constraints  on  some  30  quantifiers.  Figure  3.1 
shows  instances  of  the  two  major  modeled  classes  of 
jet  aircraft.  Tc'ese  diagrams  were  draw  by  Acronym 
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Fig.  5, It  Instances  of  class  models  of  Boelng-747s  and  L- 
1011s, 


from  the  models  given  it  to  carry  out  the  image  inter¬ 
pretations.  The  constraints  for  the  generic  class  of  wide 
bodied  jets  are  given  in  figure  3.2.  Units  are  meters. 

The  camera  was  modeled  as  being  between  1000 
and  12000  meters  above  the  ground,  Thus  there  is 
little  a  priori  knowledge  of  the  scale  of  the  images.  A 
specific  foca!  ratio  was  given;  20.  (Similar  interpreta¬ 
tions  have  been  carried  out  with  a  variable  focal  ratio, 
but  then  the  final  constraints  on  camera  height  and 
focal  ratio  are  coupled,  and  not  as  clear  for  illustra¬ 
tive  purposes  —  no  accuracy  is  lost  due  to  the  non- 
linearities  that  are  introduced  into  the  constraints,  al¬ 
though  both  computation  time  and  garbage  collection 
time  are  increased.) 

The  aircraft  models,  the  camera  model  and  the 
number  of  pixels  in  each  dimension  of  the  image 
(512  X  512  in  these  examples)  were  the  only  piece*  of 
world  knowledge  input  to  Acronym-  It  has  no  spe¬ 
cial  knowledge  of  aerial  scenes:  all  its  rules  are  about 
geometry  and  algebraic  manipulation.  These  were  ap¬ 
plied  to  the  particular  generic  models  it  was  given,  to 
make  predictions  and  then  to  carry  out  interpretations. 


Figures  3.3  through  3.5  show  three  examples  of 
interpretations  carried  out  by  Acronym.  In  each  case 
part  a  is  a  half-tone  of  the  original  grey  level  image. 
The  6  ver  sion  is  the  result  of  applying  the  line  finder 
of  Nevatia  and  Babu  (8],  That  line  finder  was  designed 
to  find  linear  features  such  as  roads  and  rivers  in  aerial 
photos.  Close  examination  of  results  on  these  images 
indicate  many  errors,  and  undue  enlargement  in  width 
of  narvow  linear  features.  It  also  produces  many  noise 
edges  in  in  smooth  brightness  gradients  (not  visible  at 
the  resolution  of  the  reproductions  of  these  figures). 
These  edges  are  the  lowest  level  input  to  Acronym. 

Au  edge  linker  [4]  is  directed  by  the  predictions 
to  look  for  ribbons  and  ellipses.  In  this  case  there  is 
very  little  a  priori  information  about  the  scale  of  the 
images.  The  c  versions  of  each  figure  show  the  rib¬ 
bons  fitted  to  the  linked  edges  when  it  is  searching 
for  candidate  matches  for  the  fuselage  and  wings  of 
aircraft.  There  is  even  further  degradation  of  image 
information  at  this  stage,  This  is  the  only  data  which 
the  Acronym  reasoning  system  is  given  to  interpret. 
Notice  that  in  the  figure  3.5  almost  all  the  shapes  cor¬ 
responding  to  aircraft  are  lost.  Quite  a  few  aircraft  in 


3.)  are  lost  also.  Besides  losing  many  shapes,  the  com¬ 
bination  of  the  edge  finder  and  edge  linker  conspire  to 
give  very  inaccurate  image  measurements.  We  assume 
all  image  measurements  have  a  ±30%  error,  except 
that  for  very  small  measurements,  we  assume  that  pixel 
noise  swamps  even  those  error  estimates.  Then  the 
error  is  estimated  to  be  inversely  proportional  to  the 
measurement  with  %  2  pixel  measurement  admitting  a 
100%  error.  Thus  the  data  which  Acronym  really  gets 
so  York  with  is  considerably  more  fussy  than  indicated 
b;.'  the  the  c  series  of  figures. 

Is  **  \ 

*  1  We  intend  to  make  use  of  new  and  better  low 

'■  vo|  descriptive  processes  being  deveiopped  in  our 
laboratory  by  other  researchers  as  soon  as  they  become 
robust  enough  for  every  day  use  (e.g.  Baker  (1)  whose 
descriptions  from  stereo  will  also  include  surface  and 
depth  information). 

Despite  this  very  noisy  descriptive  data 
Acronym  makes  good  interpretations  of  the  images. 
The  d  series  of  figures  show  its  interpretations  with  the 
ribbons  labeled  by  what  part  of  the  model  they  were 
matched  to.  (The  numbers  which  may  be  unreadable 
in  3,3 d  show  the  groupings  into  individual  aircraft.) 

Acronym  first  uses  the  most  general  set  of 
constraints,  those  associated  with  the  generic  class  of 
wide-bodied  jets,  when  carrying  out  intitial  prediction 
and  interpretation.  Interpretation  adds  additional  con¬ 
straints  for  each  hypothesised  aircraft  instance.  For  ex¬ 
ample  in  finding  the  correspondences  in  figure  3.4d  con¬ 
straints  were  added  which  eventually  constrained  the 
f  ING..J?ID':H  (the  width  of  ti  e  wings  where  they  attach 
to  fuselage)  to  lie  in  the  ranfre  [7, 10. 5P77531]  compared 
to  the  modeled  bounds  of  (T,  12).  The  height  of  the 
camera,  modeled  to  lie  in  the  range  [1000, 12000]  is  con¬ 
strained  by  the  interpretation  to  the  range  (2199, 3322). 

Once  a  consistent  match  or  partial  match  to 
a  geometric  model  has  been  found  in  the  context  of 
some  set  of  constraints  (model  class),  it  ensy  to  check 
whether  it  might  also  be  an  instance  of  a  subclass. 
We  need  only  add  the  extra  constraints  associated 
with  the  subclass  and  check  for  consistency  with  those 
already  implied  by  the  interpretation  using  the  cms 
as  described  in  section  2.1.  The  aircraft  located  in 
3.4d  is  consistent  with  the  constraints  for  an  L— 1011, 
but  not  for  a  Boeing-747.  Author  examination  of 
the  images  had  previously  indicated  that  the  aircraft 
was  an  L-1011.  The  additional  symbolic  constraints 
implied  by  accepting  that  the  aircraft  is  in  fact  an 
L-1011  propagate  through  the  entire  constraint  set. 
Although  the  constraints  decribing  an  L-1011  do  not 
include  constraints  on  camera  height,  the  back  con¬ 
straints  deduced  during  interpretation  relate  quantifers 
representing  such  quantities  as  length  of  the  wings  to 
the  height  (and  focal  ratio  in  the  more  general  case), 
Thus  the  height  of  the  camera  is  further  constrained 
in  3 Ad  to  lie  in  the  range  (2356,2489).  Recall  that 
all  image  measurements  were  subject  to  ±30%  errors, 
and  that  this  estimate  has  taken  all  such  errors  into 
account. 

Figure  3.3d  indicates  matches  were  found  for 
three  airplanes.  Examination  of  the  data  in  3.3c  in- 
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dicate*  that  this  is  the  best  that  could  be  expected. 
Not*  however  that  only  partial  matches  were  found  in 
*11  three  cases.  For  such  small  ribbons  errors  were  ap¬ 
parently  larger  than  the  generous  estimate  used.  The 
fuselage  ribbon  in  the  leftmost  aircraft  (number  1)  for 
instance  fails  to  pass  the  coarse  filtering  stage.  Despite 
the  partial  match,  this  particular  aircraft  is  found  to 
be  consistent  with  the  constraints  for  an  L— 1011,  but 
not  consistent  with  those  of  a  Boeing-747.  Again  this 
is  correct. 

The  other  two  aircraft  identified  are  even  more 
interesting.  The  author  had  thought  from  casual  in¬ 
spection  of  the  grey  level  image  that  they  were  in¬ 
stances  of  Boeing-747s.  They  both  gave  matches  con¬ 
sistent  with  the  clas3  of  wide-bodied  jets.  As  expected 
neither  was  consistent  with  the  extra  costraints  of  an 
L-1011.  However,  although  each  individual  parameter 
range  from  the  interpretation  constraint  sets  was  con¬ 
sistent  with  the  individual  parameter  value  or  range 
for  the  class  of  D.  eing-747s,  neither  set  of  constraints 
was  consistent  with  that  subclass  (the  constraints  con¬ 
tain  much  finer  information  than  just  the  parameter 
ranges  -  in  the  same  many  as  in  the  example  above 
where  constraints  on  wing  length  propagate  to  con¬ 
strain  the  camera  height).  On  close  examination  of 
the  grey-level  image  it  was  determined  that  the  aircraft 
were  not  in  fact  Boeing-747’s.  The  author  used  the  fact 
that  they  were  much  smaller  than  the  L-1011  to  make 
that  deduction,  but  the  system  made  the  deduction  at 
the  local  level  before  considering  comparisons  between 
aircraft. 

The  aircraft  (probably  Boeing-707’s,  but  at  the 
time  of  writing  we  haven’t  yet  got  engineering  drawings 
needed  to  build  an  .  accurate  model  for  Acronym  to 
check  against  the  images)  are  in  fact  too  small  to  be 
wide-bodied  jets  of  any  type.  Since  the  scale  of  the 
image  is  unknown  a  priori  this  can  not  be  deduced 
locally.  However  it  is  reflected  in  the  height  estimates 
derived  at  the  local  level  —  [5400, 8226]  interpreting  the 
L-1011  just  as  a  generic  wide-body,  ([5786,6170]  as  an 
L-1011),  and  (9007,11846]  for  the  rightmost  aircraft. 
Thus  Acronym  deduces  that  either  the  left  aircraft  is 
a  wide-body  and  the  others  are  not,  or  the  right  two 
are  wide-bodies  and  the  left  one  is  no'  (it  is  too  big). 

Finally  note  that  geometrically  there  were  other 
candidates  for  aircraft  in  the  ribbons  of  figure  3.3e. 
For  instance  the  wing  of  the  aircraft  just  to  the  right 
of  those  indentified  and  a  ribbon  found  for  its  pas¬ 
senger  ramp  could  be  the  two  wings  of  an  aircaft  with 
a  fuselage  missing  between  them.  In  fact  these  two 
ribbons  were  instantiated  as  an  aircaft  on  the  basis  of 
the  coarse  filters  on  the  nodes  and  arcs.  However  the 
set  of  back  constraints  they  generated  were  mutually 
inconsistent. 

Thus  v  e  can  3ee  from  the  examples  that  e7en 
with  very  poor  and  noisy  data  the  combined  use  of 
geonetry  and  symbolic  algebraic  constraints  can  lead  to 
accurate  image  interpretations.  The  system  should  be 
tested  on  more  accurate  low  level  data  to  fully  evaluate 
the  power  of  this  approach. 
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Abstract 

Optical  flow  cannot  be  computed  locally,  since  only  one  independent 
measurement  is  available  from  the  image  sequence  at  a  point,  while  the 
flow  velocity  lias  two  components.  A  second  constraint  is  needed,  A 
method  for  finding  the  optical  flow  pattern  is  presented  which  assumes 
that  the  apparent  velocity  of  the  brightness  pattern  varies  smoothly  al¬ 
most  cvcrywnerc  in  the  image.  An  iterative  implementation  is  shown 
which  successfully  computes  die  optical  flow  for  a  number  of  synthetic 
image  sequences.  The  algorithm  is  robust  in  that  it  can  handle  image 
sequences  that  arc  quantized  rather  coarsely  in  space  and  time.  It  is 
also  insensitive  to  quantization  of  brightness  levels  and  additive  noise. 
Examples  arc  included  where  the  assumption  of  smoothness  is  violated 
at  singular  points  or  along  lines  in  the  image. 


I.  Introduction 

Optical  flow  is  the  distrib  ition  of  apparent  velocities  of  movement 
of  brightness  patterns  in  an  image.  Optical  flow  can  arise  from 
relative  motion  of  objects  and  the  viewer  |8,  9).  Consequently,  optical 
flow  can  give  important  information  about  the  spall, -4  arrangement  of 
the  objects  viewed  and  the  rate  of  change  of  Uiis  arrangement  (10). 
DisccMinuiiics  in  the  optical  flow  can  help  in  segmenting  images  into 
regions  that  correspond  to  different  objects  [29).  Attempts  have  been 
made  to  perform  such  segmentation  using  differences  between  succes¬ 
sive  image  frames  (17,  18,  19,  22,  3.  27).  Several  papers  address  die 
problem  of  recovering  die  motions  of  objects  relative  to  the  viewer 
fiom  the  optical  flow  (12,  20,  21,  23,  31).  Some  recent  papers  provide 
a  clear  exposition  of  this  enterprise  )32,  33).  Hie  mathematics  can 
be  made  rather  difficult,  by  die  way,  by  chosing  an  inconvenient  coor¬ 
dinate  system.  In  some  cases  information  about  the  shape  of  an  object 
may  also  be  recovered  )4,  20, 21). 

These  papers  begin  by  assuming  diat  the  optical  flow  has  already 
been  determined.  Although  some  reference  has  been  made  to  schemes 
for  co..  puting  the  flow  from  successive  views  of  a  scene  (7.  12],  the 
specifics  of  a  scheme  Par  determining  the  flow  from  the  image  have 
not  been  described.  Related  work  has  been  done  in  an  attempt  to 
formulate  a  model  for  the  short  range  motion  detection  processes  in 
human  vision  [2  24).  The  pixel  recursive  equations  of  Nclravali  and 
Robbins  [30),  designed  for  coding  motion  in  television  signals,  bear 
some  similarity  to  die  iterative  equations  developed  in  this  paper. 
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A  recent  teview  (28)  of  computational  techniques  for  the  analysis  of 
image  sequences  contains  over  ISO  references. 

The  optical  flow  cannot  be  computed  at  a  point  in  the  image  in¬ 
dependently  of  neighboring  points  without  introducing  additional  con¬ 
st!  aims,  because  the  velocity  field  at  each  image  point  has  two  com¬ 
ponents  while  the  change  in  image  brigntness  at  a  point  in  the  image 
plane  due  to  motion  yields  only  one  constraint.  Consider,  for  example, 
a  patch  of  a  pattern  where  brightness1  varies  as  a  function  of  one  image 
coordinate  but  not  Lie  other.  Movement  of  the  pattern  in  one  dircc 
don  alters  the  brightness  at  a  particular  point,  but  motion  in  the  other 
direction  yields  no  change.  Thus  components  of  movement  in  the  latter 
oircction  cannot  be  determined  locally. 


2.  Relationship  to  Object  Motion 

The  relationship  between  the  optical  flow  in  the  image  plane 
and  the  velocities  of  objects  in  the  three  dimensional  world  is  not 
necessarily  obvious.  We  perceive  motion  when  a  changing  picture  is 
projected  onto  a  stationery  screen,  for  example.  Conversely,  a  moving 
object  may  give  rise  to  a  constant  brightness  pattern.  Consider,  for  ex¬ 
ample,  a  uniform  sphere  which  exhibits  shading  because  its  surface  ele¬ 
ments  arc  oriented  in  many  different  directions.  Yet,  when  it  is  rotated, 
the  optical  (low  is  zero  at  all  points  in  the  image,  since  the  shading 
docs  not  move  with  the  surface.  Also.  specular  reflections  move  with 
a  velocity  characteristic  iff  tlir  virtual  image,  not  die  surface  in  which 
light  is  reflected. 

For  convenience,  we  tackle  a  particularly  simple  world  where  the 
apparent  velocity  of  brightness  patterns  can  be  directly  identified  with 
the  movement  of  sur'aces  in  the  scene. 


3.  flic  Restricted  Problem  Domain 


To  avoid  variations  in  brightness  dee  to  shading  effects  we  initially 
assume  diat  the  surface  being  imaged  is  flat.  We  further  assume  dial 
the  incident  illumination  is  tn'fomi  across  the  surfavc.  The  brightness 


1  In  this  p»ptr,  the  term  brlgblrcss  mean.’  invigc  irradiimt.  The  bripancrv  pattern 
i,  die  distribution  rf  irradlanr-c  In  the 


at  e  noint  in  tl'.c  image  is  then  oruporlional  to  the  reflectance  of  the  sur¬ 
face  at  the  corresponding  point  on  the  objoct.  Also,  we  assume  at  first 
that  reflectance  varies  smoothly  and  has  no  spatial  discontinuities.  I  "his 
latter  condition  assures  us  that  die  image  brightness  is  differentiable. 
We  c.-.clude  situations  where  objects  occlude  one  another,  it:  part,  be¬ 
cause  discontinuities  in  reflectance  arc  found  at  object  boundaries.  In 
two  of  die  experiments  discussed  later,  some  of  die  problems  occa¬ 
sioned  by  occluding  edges  arc  exposed. 

In  die  simple  situation  described,  die  motion  of  the  brightness 
patterns  in  die  imago  is  determined  directly  by  die  motions  of  cor¬ 
responding  points  on  the  surface  of  the  object.  Computing  the 
velocities  of  points  tin  die  object  is  t  matter  of  simple  geometry  once 
die  optical  flow  is  known. 


4.  Constraints 


We  will  derive  an  equation  that  relates  the  change  in  image 
brightness  at  a  point  to  the  motion  of  die  brightness  pattern.  Let  the 
image  brightness  at  the  point  [x,  y)  in  the  image  plane  at  t.mc  t  be 
denoted  by  E( x,  y,  t).  Now  consider  what  happens  when  the  pattern 
moves.  '!bc  brightness  of  a  particular  point  in  the  pattern  is  constant,  sc 


that 


dE 

di 


=  0 


Using  the  chain  ’  tie  for  dilfcicmialion  we  see  that. 


OF-  dx  dF.  dy  OF 
Ox  di  +  Oy  dt  <9t  ' 


(See  Appendix  A  for  a  more  detailed  derivation.)  If  we  let 


u 


dx 

di 


and  v  = 


dy 

di’ 


then  it  is  easy  to  see  that  we  have  a  single  linear  equation  in  the  two 
unknowns  u  and  v, 


E,u-\-E„v+El=D, 

where  we  have  also  introduced  die  additional  abbreviations  Er.  Eu,  and 
Ei  for  the  partial  derivatives  of  image  brightness  with  respect  to  s,  y 
and  t,  respectively.  The  constraint  on  the  heal  flow  velocity  expressed 
tty  diis  equation  is  illustrated  in  f  igure  1.  Veiling  the  equation  in  still 
another  way, 

(c7x,Eu)  •  («,«)  --  —Et. 

Tlius  the  component  of  the  movement  m  the  direction  of  the  bright¬ 
ness  gradient  [F!tl  Etl)  equals 


Et 


We  cannot,  however  determine  the  component  of  the  movement  in  'he 
direction  of  die  iso-brightness  contours,  a;  rigid  angles  (o  the  bright¬ 
ness  gradient.  As  a  consequence,  the  flow  velocity  (u,  o)  cannot  be 
computed  locally  without  introducing  additional  constraints. 


5.  The  Smoothness  Constraint 


1 1  ever,  point  of  the  brightness  pattern  can  move  independently, 
there  is  Mule  hope  of  ■.■ecovcring  five  velocities.  More  commonly  we 
view  opaque  objects  of  finite  si/.c  undergoing  rigid  motion  or  defor¬ 
mation.  In  this  case  neighboring  points  on  the  objects  have  similar 
velocities  and  the  velocity  field  of  llie  brightness  patterns  in  the  image 
varies  smiiotlily  almost  everywhere.  I  discontinuities  in  flow  can  be  ex¬ 
pected  where  one  object  occludes  another.  An  algorithm  based  on  a 
smoothness  constraint  is  likely  to  have  difficulties  with  occluding  edges 
as  a  result. 

One  way  to  express  die  additional  constraint  is  lo  minimize  the 
square  of  the  magnitude  of  the  gradient  of  the  optical  flow  velocity 


Another  measure  of  die  smoothness  of  the  optical  flow  field  is  the  sum 
of  the  squares  of  the  l.apiacians  of  the  two  velocity  components.  'Ihe 
I  .aplacians  of  u  and  v  ere  defined  as 


V'u  -= 


<9*u  Eu 
dx1  +  Oii* 


and 


V2v  = 


Ev  fPv 

di1  +  Oy2 


l:i  simple  situations,  both  I  .aplacians  arc  zero.  If  the  viewer  translates 
parallel  lo  a  flat  object,  rotates  about  a  line  perpendicular  to  the  surface 
or  travels  orthogonally  to  the  surface,  then  the  seen  d  partial  deriva¬ 
tives  of  both  u  and  t>  vanish  (assuming  pcrspcclivc  projection  in  the 
image  formation.) 

In  this  paper,  we  will  use  lit  square  of  the  magniddc  of  the 
gradient  as  our  smoothness  measure.  Note  that  our  approach  is  in 
contrast  with  ih  it  of  [7],  who  propose  an  algorithm  that  incorporates 
additional  assumption"  such  as  constant  llow  velocities  within  discrete 
regions  of  the  image,  Their  method,  t  ased  on  cluster  analysis,  cannot 
deal  with  rotating  objects,  since  these  give  rise  to  a  conrinuum  of  flow 
velocities. 


b.  Quantitation  and  Noise 

Images  ory  be  sampled  at  intervals  on  a  fixed  griu  of  points. 
While  tcsscHions  odicr  than  die  obvious  one  have  certain  advantages 
ULU  for  com  cnicnce  we  will  assume  that  the  image  is  sampled  on  a 
square  grid  at  regular  intervals.  Let  the  measured  brightness  be  2Jj, y,k 
at  the  inlet  section  of  die  i-Ji  row  and  j-th  column  in  die  fc-lli  image 
frame.  Ideally,  cadi  measurement  should  oc  an  average  over  die  area 
of  a  picture  cell  and  over  the  length  of  the  lime  interval.  In  die  experi¬ 
ments  cited  here  we  have  taken  samples  at  discrete  points  in  space  and 
time  ins.fad. 

In  addition  to  being  quantized  in  space  and  'he,  die  measure¬ 
ments  will  in  practice  lie  (localized  in  brightness  as  well.  L'unhor,  -toisc 
wail  be  apparent  m  measurements  obtained  in  any  rca’  system. 
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7.  KMiiRiitins  the  Partial  Derivatives 

We  must  estimate  the  derivatives  of  brightness  from  the  discrete 
set  of  image  brightness  measurements  available.  It  is  important  that 
the  estimates  of  Ex,  Ev,  and  Et  be  consistent.  That  is,  they  should  all 
re  <  '  me  point  in  the  image  at  the  same  time.  While  there 

art  ii.  Mtii/.s  e,'x  approximate  differentiation  [S,  1.1)  we  will  use 
a  winch  gives  us  an  estimate  of  E- ,  Eu,  E(  at  a  point  in  the  center 
of  a  cube  formed  by  eight  measurements,  the  relationship  in  space 
and  time  between  these  measurements  is  shown  in  Figure  2.  Kach  of 
the  estimates  is  the  average  of  four  first  differences  taken  over  adjacent 
measurements  in  the  cube. 

Et  <^\{Ei,j+l,k  —  Et,j,k  4  £<+i,i+i.fc  —  .i,k 

-l-Ei.y+l.fc+l  -  BtJ.fc+l  4  £'«+1,j4I,*+1  ~  ^i-i-U.k-fl} 

Ey  f=3^{Ei-pi,j,fc  —  Ei,j,k  4  fii+l.J+l.fc  —  Ei,j. f  |,fc 

+  ^i+l,J,fc+l  ~Ei,j,k+l  +Si+l,|+l,t+l  ~ 

Et  +  Sf+t,i,k+t  —Ei+ 1,J> 

Here  the  unit  of  length  is  the  grid  spacing  interval  in  each  image  frame 
and  the  unit  of  time  is  the  image  frame  sampling  period.  We  avoid  es¬ 
timation  formulae  with  larger  support,  since  these  typically  arc  equiv¬ 
alent  to  formulae  of  small  support  applied  to  smoothed  images  [16). 


8.  Estimating  the  l  .aplacian  of  the  Flow  Velocities 

As  will  be  shown  in  the  next  section,  we  will  also  need  to  ap¬ 
proximate  the  l.aplacians  of  u  and  v.  One  convenient  approximation 
takes  the  following  form 

V2u  ea  K(ulilik  —  Ui.j.fc)  and  V2t>  *(vij,k  —  vi.y,*), 
where  die  local  averages  u  and  v  arc  defined  as  follows 

UiJ./r  ~  ^{«i-IJ,ft  4  ui,j+l.k  4  «l+l,j,k  4  U»,j— l,fc} 

4  £2^“*— ,.J— ,,fc  +  u*~i  f+iTr  ui+i.j+ i.*4  u*+i,J-i,t} 
f’i.j !.fc  =  g{w—  i.j.k  +  fl.j+ljfc  4-  vi+l, j.k  4  W.j-I.fc} 

4  i,y-t,fc  4  v* — i.j-M ,<c  4  Vi-Hj-H,*  4  w»+i,y— i,*}- 

The  proportionality  factor  k  equals  1  if  the  average  is  computed  as 
shown  and  we  again  assume  dial  the  unit  of  length  equals  the  grid 
spacing  interval.  Figure  3  illustrates  the  assignment  of  weights  to  neigh¬ 
boring  points.  'Ihc  approximation  for  the  I  aplacian  using  the  center 
ceil  and  all  eight  neighbors  is  more  stable  than  the  usual  one  based  on 
the  center  cell  and  its  four  horizontal  and  vertical  neighbors  only. 


9.  Minimization 

The  problem  then  is  to  minimize  the  sum  of  the  errors  In  the 
equation  for  the  rate  of  change  of  image  brightness, 

6s  —  EtU  4  Eyt>  4  Et, 

and  the  measure  of  the  departure  from  smoothness  in  the  velocity  flow, 

What  should  be  the  relative  weight  of  these  two  factors?  In  practice 
the  image  brightness  measurements  will  be  corrupted  by  quantization 
error  and  noise  so  that  we  cannot  expect  S*  to  be  identically  zero.  This 
quantity  will  tend  to  have  an  error  magnitude  thht  is  proportional  to 
the  noise  in  the  measurement.  This  fact  guides  us  in  choosing  a  suitable 
weighting  factor,  denoted  by  a2,  as  will  be  seen  later, 
l.ct  tlic  total  error  to  be  minimized  be 

6 2=  J  J a262c  +&ldxdy. 

The  minimization  is  to  be  accomplished  by  finding  suitable  values  for 
the  optical  flow  velocity  (u,  v).  Using  the  calculus  of  variations  [6,  pp. 
191-92|,  we  obtain 

E]u  +  ExEvv  =  q2V2u  -  EtEt 
EtEyti  4  £*«  =  «2V2t>  -  EyEt. 

Using  the  approximation  to  the  I  jp'acian  introduced  in  the  previous 
section, 

(a2  +  E2)u  4  EtEuv  «  (q2u  -  E,Et) 

EtEyti  +  (o2  +  £2)o  =  (a2i>  -  EyEt). 

The  detenninant  of  the  coefficient  matrix  equals  a2(o2  -f-  Ej  4  £j)- 
Solving  for  u  and  v  we  lir.d  that 

(a2  4E2  4  E2y)u  =  4  (a2  4  E‘)u  -  E,EV 0  -  EJEt 
(a2  +  £2  4  E2v)v  =  -  E^ti  4  (a2  4  B*)D  -  EJEt. 


10.  Difference  of  Flow  at  a  Point  from  Local  Average 
These  equations  can  be  written  in  the  alternate  form 

(a2  4  El  4  E2v)(u  -  U)  -  -  EJEtd  +  Evb  4  Et\ 

(a2  4  El  4  Ej)(v  -&)  =  -  EylEtii  -f  Eub  4  Et). 

This  shows  that  the  value  of  the  flow  velocity  (u,  v)  which  minimises 
the  error  62  lies  in  the  direction  towards  the  constraint  line  along  a 
line  that  intersects  the  constraint  line  at  right  angles.  This  relationship 
is  illustrated  geometrically  in  Figure  4.  The  distance  from  the  local 
average  is  proportional  to  the  error  in  the  basic  formula  for  rate  of 
change  of  brightness  when  0,  6  are  substituted  for  u  and  v.  Finally 
we  can  see  that  a1  plays  a  significant  role  only  for  areas  where  the 
brightness  gradient  is  small,  preventing  haphazard  adjustments  to  the 
estimated  flow  velocity  occasioned  by  noise  in  the  estimated  deriva¬ 
tives.  This  parameter  should  be  roughly  equal  to  the  expected  noise  In 
the  estimate  of  E\  4 


II.  Constrained  Minimization 

When  we  allow  a2  to  tend  to  zero  we  obtain  the  solution  to  a 
constrained  minimization  problem.  Applying  die  method  of  Lagrange 
multipliers  135,  3(>]  to  the  problem  of  minimizing  62  while  maintaining 
S6  =  0  leads  to 

E„V2u  =  ErV2o,  E,u  Eyv  +  E,  =  0. 

Approximating  the  I  aplacian  by  the  difference  of  the  velocity  at  a 
point  and  the  average  of  its  neighbors  then  gives  us 

(E*  +  E2)(u  -  u)  -  -  E,[E,u  +  Euv  -I-  /il] 

(E'i  +  E2)(ti  -  v)  =  -  Ey\Ejii  H-  Eyv  +  Ei\. 

Referring  again  to  Figure  4,  we  note  that  the  point  computed  here  lies 
at  the  intersection  of  the  constraint  line  and  die  line  at  right  angles 
through  the  point  (u,  0).  We  will  not  use  these  equations  since  we  do 
expect  errors  in  the  estimation  of  the  partial  derivatives. 


12.  Iterative  Solution 

We  now  have  a  pair  of  equations  for  each  point  in  the  image,  it 
would  be  very  costly  to  solve  these  equations  simultaneously  by  one 
of  the  standard  methods,  such  as  Gauss-Jordan  elimination  [13,  14). 
'Hie  corresponding  matrix  is  sparse  and  very  large  since  the  number 
of  tows  and  columns  equals  twice  die  number  of  picture  cells  in  the 
image.  Iterative  methods,  such  as  die  Gauss-Seidel  mcdiod  [13,  15], 
suggest  themselves.  We  can  compute  a  new  set  of  velocity  estimates 
(u"  *  u’1+l)  from  the  estimated  derivatives  and  die  average  of  the 

previous  velocity  estimates  (un,  tin)  by 

u"+>  =«"  -  E,|E,un  +  Eyi"  4  Ei]/(a2  +  E2+  Ej) 

H  =&'*  -  Eu[Etu"  +  Eyi>n  4-  E<j/(a2  4-E24-  Ej). 

(It  is  interesting  to  note  that  the  new  estimates  at  a  particular  point  do 
not  depend  directly  on  the  previous  estimates  at  the  same  point.) 

The  natural  boundary  condition  for  die  variational  problem  turns 
out  to  be  a  zero  normal  derivative.  At  the  edge  of  die  image,  some  of 
the  points  needed  to  compute  the  local  average  of  velocity  lie  outside 
the  image.  Here  we  simply  copy  velocities  from  adjacent  points  further 
in. 


13.  Filling  In  Uniform  Regions 

In  parts  of  the  image  where  the  brightness  gradient  is  zero,  the 
velocity  estimates  will  simply  be  averages  of  the  neighboring  velocity 
estimates.  There  is  no  local  information  to  constrain  the  apparent 
velocity  of  motion  of  the  brightness  pattern  in  these  areas,  Eventually 
the  values  around  such  a  region  will  propagate  inwards.  If  the  velocities 
on  die  border  of  the  region  arc  alt  equal  to  the  same  value,  then  points 
in  the  region  will  be  assigned  that  value  too,  after  a  sufficient  number 
of  iterations.  Velocity  information  is  thus  Tilled  in  from  the  boundary  of 
a  region  of  Constant  brightness. 

If  the  values  on  the  border  are  not  all  the  same,  it  is  a  little  more 
difficult  to  predict  what  w  ill  happen.  In  all  eases,  the  values  filled  in 
will  correspond  to  die  solution  of  die  l.apiaec  equation  for  the  given 
boundary  condition  [1, 26, 34], 


Tile  progress  of  this  filling-in  phenomena  is  similar  to  the  propaga¬ 
tion  effects  in  the  solution  of  die  heat  equation  for  a  uniform  flat  plate, 
where  die  time  rate  of  change  of  temperature  is  proportional  to  the 
I  .aplacian.  This  gives  us  a  means  of  utiderst. aiding  die  iterative  method 
in  physical  terms  and  of  estimating  the  number  of  steps  required.  The 
number  of  iterations  should  be  larger  than  the  number  of  pictures  cells 
across  the  largest  region  that  must  be  filled  in  If  the  size  of  such 
regions  is  not  known  in  advance  one  may  use  die  cross-section  of  the 
whole  image  as  a  conservative  estimate. 


14.  'ligln  ‘  of  Constraint 

When  brightness  in  a  region  is  a  linear  function  of  die  image  coor¬ 
dinates  we  can  only  obtain  the  component  of  optical  flow  in  die  direc¬ 
tion  of  die  gradient.  ’Hie  component  at  right  angles  is  filled  in  from  the 
boundary  of  the  region  as  described  before.  In  general  the  solution  is 
most  accurately  determined  in  regions  where  the  brightness  gradient  is 
not  too  small  and  varies  in  direction  from  point  to  point.  Information 
which  constrains  both  components  of  the  optical  flow  velocity  is  then 
available  in  a  relatively  small  neighborhood.  Too  violent  fluctuations 
in  brightness  on  the  urncr  hand  arc  not  desirable  since  the  estimates 
of  the  derivative,  will  be  cormptcd  as  the  result  of  Jiutcrsampling  and 
aliasing. 


IS.  Choice  of  Iterative  Scheme 

As  a  practical  matter  one  has  a  choice  of  how  to  interlace  (he  itera¬ 
tions  with  the  time  stc;  s.  On  die  one  hand,  one  could  iterate  until  the 
solution  has  stabilized  before  advancing  to  die  next  image  frame.  On 
the  other  hand,  given  a  good  initial  guess  one  may  need  only  one  itera¬ 
tion  per  time-step.  A  good  initial  guess  for  the  optical  flew  velocities  is 
usually  available  from  the  previous  time-step. 

’Ihe  advantages  of  die  lane;  approach  include  an  ability  to  deal 
with  more  images  per  uni  ..me  and  betler  estimates  of  optical  flow 
velocities  in  certain  regions.  Areas  in  which  die  brightness  gradient 
is  small  lead  to  uncertain,  noisy  estimates  obtained  partly  by  filling 
in  from  tile  surround.  Ihcsc  estimates  arc  improved  by  considering 
furdicr  images.  The  noise  in  measurements  of  the  images  will  be  inde¬ 
pendent  and  tend  to  cancel  out.  Perhaps  more  importantly,  different 
parts  of  the  pattern  will  drift  by  a  given  point  in  the  image.  Ihe  direc¬ 
tion  of  die  brightness  gradient  will  vary  with  time,  providing  informa¬ 
tion  about  both  components  of  the  optical  flow  velocity. 

A  practical  implementation  would  most  likely  employ  one  itera¬ 
tion  per  lime  step  Tor  these  reasons.  We  illustrate  both  approaches  in 
the  experiments. 
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16.  Kxpcrimcnts 

The  iterative  scheme  lias  been  implemented  and  applied  to  image 
sequences  corresponding  to  a  number  of  simple  (low  patterns.  Ihe 
results  sliown  here  arc  for  a  relatively  low  resolution  image  of  32  by 
32  picture  cells.  The  brightness  measurements  were  intentionally  cor¬ 
rupted  by  approximately  1%  noise  and  then  quantized  into  256  levels 
to  simulate  a  real  imaging  situation.  'Hie  underlying  surface  reflectance 
pattern  was  a  linear  combination  of  spatially  orthogonal  sinusoids. 
Their  wavelength  was  chosen  to  give  rcas*-..  >v  uv'?  brightness 
gradients  without  leading  to  undcrsaniplur;  y.-oble  •<...  ■  discontinuities 
were  avoided  to  ensure  that  the  required  dcrivao  r  rise  everywhere. 

Shown  in  figure  5,  for  example,  arc  four  trmt'  :x  of  a  sequence 
of  images  depicting  a  spire  i  rotating  about  an  axis  inclined  towards 
the  viewer.  A  smoothly  varying  reflectance  pattern  is  painted  on  the 
surface  of  the  sphere.  The  sphere  is  illuminated  uniformly  from  all 
directions  so  that  there  is  no  shading.  We  chose  to  work  with  synthetic 
image  sequences  so  that  we  can  compare  the  lesults  of  the  optical  flow 
computation  w  ith  die  exact  lalu'-s  calculated  using  die  transformation 
equations  relating  image  coordinates  to  coordinates  on  die  underlying 
surface  reflectance  pattern. 


17.  Results 

The  first  flow  to  be  investigated  was  a  simple  linear  translation  of 
the  entire  brightness  pattern.  The  resulting  computed  flow  is  shown 
as  a  needle  diagram  in  Figure  6  for  1,  4,  16,  and  64  iterations.  The 
estimated  flow  velocities  are  depicted  as  short  lines,  showing  the  ap¬ 
parent  displacement  during  one  time  step.  In  this  example  a  single  time 
step  was  taken  so  dint  the  computations  arc  based  on  just  two  images. 
Initially  the  estimates  of  flow  velocity  arc  zero.  Consequently  the 
first  iteration  shows  vectors  in  the  direction  of  the  brightness  gradient. 
I. titer,  the  estimates  approach  die  correct  values  in  all  parts  of  the 
image.  Few  changes  occur  after  37  iterations  when  die  velocity  vectors 
have  errors  of  about  10%.  The  estimates  lend  to  be  too  small,  rather 
than  too  large,  perhaps  because  of  a  tendency  to  underestimate  the 
derivatives.  The  worst  errors  occur,  as  one  might  expect,  where  the 
brightness  gradient  is  small. 

In  die  second  experiment  one  iteration  was  used  per  time  step  on 
die  same  linear  translation  image  sequence.  The  resulting  computed 
flow  is  shown  in  Figure  7  for  1,  4,  16,  and  64  time  steps.  The  estimates 
approach  the  correct  values  more  rapidly  and  do  not  have  a  tendency 
to  be  too  small,  as  in  the  previous  experiment.  Few  changes  occur 
after  16  iterations  when  the  velocity  vectors  have  errors  of  about  7%. 
The  worst  errors  occur,  as  one  might  expect,  where  the  noise  in  recent 
measurements  of  brightness  was  worst.  While  individual  estimates  of 
velocity  may  not  be  very  accurate,  the  average  over  die  whole  image 
wr:  within  1%  of  the  correct  value, 

Next,  the  method  was  applied  to  simple  rotation  and  simple  con¬ 
traction  of  the  brightness  pattern.  The  results  after  32  time  steps  are 
shown  in  Figure  8.  Note  that  the  magnitude  of  the  velocity  is  propor¬ 
tional  to  the  distance  'rotti  the  origin  of  the  flow  in  both  of  these  cases. 
(By  origin  we  mean  the  point  in  die  image  where  the  velocity  is  zero.) 


In  the  examples  so  far  the  l.aplacian  of  both  flow  velocity  com¬ 
ponents  is  zero  everywhere.  We  also  studied  more  difficult  cases  where 
diis  wtis  not  the  case.  In  particular,  if  we  let  the  magnitude  of  the 
velocity  vary  as  the  inverse  of  die  distance  from  the  origin  wc  generate 
flow  around  a  line  vortex  and  two  dimensional  flow  into  a  sink.  The 
computed  flow  patterns  tire  shown  in  Figure  9.  In  these  examples,  the 
computation  involved  many  iterations  based  on  a  single  time  step.  The 
worst  errors  occur  near  the  singularity  at  die  origin  of  the  flow  pattern, 
where  velocities  arc  found  which  arc  much  larger  diun  one  picture  cell 
per  time  step. 

Finally  we  considered  rigid  body  motions.  Shown  in  Figure  10 
are  the  flows  computed  for  a  cylinder  rotating  about  its  axis  and  for 
ti  rotating  sphere,  in  both  cases  the  l  .aplacian  of  the  flow  is  not  zero 
and  in  fact  the  l  .aplacian  of  one  of  the  velocity  components  becomes 
infinite  on  the  occluding  hound.  Since  the  velocities  themselves  remain 
finite,  reasonable  solutions  are  still  obtained.  Tile  correct  (low  patterns 
arc  shown  in  Figure  11.  Comparing  the  computed  and  exact  values 
shows  Ihiil  the  worst  errors  occur  on  the  occluding  boundary,  These 
boundaries  constitute  a  one  dimensional  subset  of  the  plane  and  so  one 
can  expect  that  the  relative  number  of  points  at  which  the  estimated 
flow  is  seriously  in  error  will  decrease  its  the  resolution  of  the  image  is 
*  made  finer. 

In  Appendix  B  it  is  shown  that  there  is  a  direct  relationship  be¬ 
tween  die  I  .aplacian  of  the  flow  velocity  components  and  the  I  apiacian 
of  the  surface  height.  Ibis  can  lie  used  to  see  how  our  smoothness 
constraint  will  fare  for  different  objects.  For  example,  a  rotating 
polyhedron  will  give  rise  to  a  flow  which  has  zero  l.aplacian  except  on 
the  image  lines  which  are  die  projections  of  die  edges  of  die  body. 


18.  Summary 

A  method  lias  been  developed  for  computing  optical  flow  from 
a  sequence  of  images.  It  is  based,  on  die  observation  dial  the  flow 
velocity  has  two  components  and  that  die  basic  equation  for  the  rate  of 
change  of  image  brightness  provides  only  one  constraint.  Smoothness 
of  the  flow  wtis  introduced  as  a  second  constraint.  An  iterative  method 
fur  solving  the  resulting  equation  was  then  developed.  A  simple  im¬ 
plementation  provided  visual  confirmation  of  convergence  of  the  solu¬ 
tion  in  die  form  of  needle  diagrams.  Kxampicsof  several  different  types 
of  optical  flow  patterns  were  studied.  'Ilicsc  included  cases  where  the 
l.aplacian  of  the  flow  was  zero  as  well  as  cases  where  it  became  infinite 
at  singular  points  or  along  bounding  curves. 

Ihe  computed  optical  flow  is  somewhat  inaccurate  since  it  is  based 
on  noisy,  quantized  measurements.  Proposed  methods  for  obtaining 
information  about  die  shapes  of  objects  using  derivatives  (divergence 
and  curl)  of  the  optical  flow  field  may  turn  out  to  be  impractical  since 
the  inaccuracies  will  be  amplified. 
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Appendix  A.  Kate  of  Change  of  Image  Brightness 

Consider  a  patch  of  the  brightness  pattern  that  is  displaced  a  dis¬ 
tance  Sx  in  the  x-dirfciion  and  ty  in  the  y-direction  in  time  St.  Ihc 
brightness  of  the  patch  i.  assumed  to  remain  constant  so  that 


E(z,  V,  0  ~  +  Sx,  y  -|-  S y,  t  f  St). 


Expanding  the  l  ight  hand  side  about  the  point  (x,  y,  t)  we  get, 
/?(*, t)  =  Etx,  y,  i)  4-  Sxd£  +  6 y™  -|-  St™  +  e. 


Whcc  e  contains  second  and  higher  order  terms  in  Sx.  6y,  and  St. 
After  subtracting  E(x,  y,  t)  from  both  sides  and  dividing  through  by  St 
we  have 

SxOE  Sy&E  SE 

St  Ox  +  St  Sty  4  St  \-W)  =  0, 


v.h"re  0(6 1)  is  a  tcim  of  order  St  (we  assume  that  Sx  and  Sy  vary  as 
St.)  In  the  limit  as  if  -- •  0  this  becomes 


SE  dx  OE  dy  OE 
Sx  dt  +  Sy  dt  '*■  St 


Appendix  K.  Smoothness  of  f  low  for  Rigid  Body  Motions 

l.ct  a  rigid  body  rotate  about  an  axis  (to,,  uu,  io;),  where  the 
magnitude  of  the  vector  equals  the  angular  velocity  of  the  motion.  If 
this  axis  passes  through  the  origin,  then  die  velocity  of  a  point  (x, 
y.  z)  equals  the  cross  product  of  (to,,  u>_),  and  (x,  y,  z).  'Hicrc 

is  a  direct  relationship  between  the  image  coordinates  and  the  t  and 
y  coordinates  here  if  we  assume  that  the  image  is  generated  by  or- 
(hographic  projection.  The  x  and  y  components  of  the  velocity  can  be 
written, 

ti  =WyZ  —  ui.y 

V  =1 0.-X  —  UxZ. 


VJu  =  -1~  U)UV2Z 

V2v  =  --  to,V’2*. 

I  bis  illustrates  that  die  smooth  ness  of  die  optical  flow  is  rckitcd  directly 
to  the  smoothness  of  the  rotating  body  and  that  the  I  .aplacian  of  the 
(low  vehxity  will  become  infinite  on  the  occluding  bound,  since  the 
partial  derivatives  of  2  with  respect  to  x  and  y  become  infinite  there. 
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V 


Figure  2.  The  llircc  p;nii.il  derivatives  of  image  brightness  ai  the  cen¬ 
ter  nf  the  etihe  arc  each  estimated  from  the  average  of  (irsi  differences 
along  four  |)«trallel  edges  of  the  etihe.  Here  the  column  aulcx  j  cor- 
responds  to  the  t  dircuion  in  the  image,  the  row  index  t  to  the  y 
direction,  while  k  lies  in  the  lime  direction. 
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Figure  3.  The  l.apla.ian  is  estimated  by  subtracting  the  value  at  a 
point  from  a  weighted  average  of  die  values  at  neighboring  points. 
Shown  here  are  suitable  weights  by  which  values  can  be  multiplied. 


Figuro  4.  The  value  of  the  flow  v< 
lies  on  a  line  drawn  from  die  local  av 
dicttlar  to  the  constraint  line. 
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Figure  6.  l-lott  p.utoru  cuinpuieiJ  lor  simple  mmslution  of  n  brightness  pattern.  The  (.-stiniiHcs  aflcr  1,  4,  16,  and  64 
iterations  are  shown.  1  he  velocity  is  0.5  picture  cells  in  the  x  direction  and  1.0  picture  cells  in  the  y  direction  per  time 
interval.  Two  images  are  used  as  input,  depicting  the  situation  at  two  times  separated  by  one  time  interval. 
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Figure  7 .  Mow  pattern  computed  for  simple  translation  of  a  brightness  pattern.  The  estimates  after  I,  4,  16,  and 
04  time  steps  are  shown.  Here  one  iteration  is  used  per  time  step.  Convergence  is  more  rapid  and  the  velocities  arc 
estimated  more  accurately. 
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figure  8.  Flow  piittcrns  computed  for  simple  rotation  and  simple  contraction  of  a  brightness  pattern.  In  the  first 
ease,  the  pattern  is  rotated  about  2.8  degrees  per  time  step,  while  it  is  contracted  about  5%  per  time  step  in  the  second 
Cose.  The  estimates  a  IK  'hue  steps  are  shown. 


Figure  9.  Flow  patterns  computed  for  flow  around  a  line  \orlex  and  wo  dimensional  hew  into  a  sink.  In  each  ease 
the  estimates  alter  32  iterations  ore  shown. 
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riguro  10,  Mow  patterns  computed  for  a  cylinder  rotating  about  its  axis  and  Tor  a  rotating  sphere.  The  axis  of  the 
cylinder  i>  inclined  30  degrees  towards  die  viewer  and  that  of  the  sphere  45  degrees,  ihuh  are  rotating  at  about  5 
degrees  per  time  step.  The  estimates  shown  are  obtained  after  32  time  steps. 
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Figure  1 1 .  I'.xact  llow  patterns  for  the  cylinder  and  the  sphere. 
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COMPUTATIONAL  STEREO  FROM  AN  III  PERSPECTIVE* 


Stephen  T.  Barnard 
Martin  A.  Flachler 


SRI  International,  Menlo  Park,  California 


I  INTRODUCTION 


This  paper  surveys  and  evaluates  computational 
methods  Cor  the  recovery  of  depth  Information  from 
multiple  Images*  We  identify  the  major  functional 
components  chat  comprise  these  methods,  list 
various  alternatives  for  implementing  them,  and 
analyze  the  domain-dependent  and  application- 
dependent  constraints  that  favor  some  alternatives 
over  others.  Finally,  we  outline  a  program  for 
evaluating  the  various  components  and  systems  on 
the  III  Testbed. 

The  scope  of  this  paper  is  restricted  to 
research  in  the  image  understanding  community.  IU 
researchers  have  drawn  on  stereo  work  from  other 
areas,  especially  cartography,  psychology,  and 
neurophysiology.  We  will  not  try  to  cover  all  the 
IU  research  relevant  to  stereo,  but  Instead  will 
select  a  cross-section  of  the  most  widely-known 
work  that  covers  all  the  important  and 
significantly  different  approaches  to  the  stereo 
problem. 

Much  of  the  research  In  image-understanding 
has  been  devoted  to  recovering  the  range  and 
orientation  of  surfaces  and  objects  depicted  in 
imaged  data.  The  earliest  work  concentrated  on  an 
artificial  domain  —  the  "blocks  world." 

Signifies  (but  net  necessarily  extendable) 
advances  were  made  in  this  simple  domain;  in 
particular,  it  was  shown  that  edge  and  vertex 
labeling  schemes  cor  Id  provide  constraints  that 
allowed  one  to  correctly  partition  a  complex  scene. 
More  recent  work,  which  has  concentrated  on  real 
world  problems,  can  be  divided  into  three  classes: 
(1)  those  that  acquire  range  information  directly 
with  an  active  sensor,  (2)  those  that  depend  on 
monocular  information  available  in  a  single  image 
(or  perhaps  several  Images  from  a  single  viewpoint 
under  different  lighting),  and  (3)  those  that  use 
two  or  sore  images  taken  from  different  viewpoints 
and  perhaps  at  different  times.  We  are  concerned 
here  with  this  third  class,  which  we  shall  refer  to 
as  "generalised  stereo." 

The  generalised  stereo  paradigm  Includes 
conventional  stereo,  as  well  as  what  is  often 
called  motion  parallax.  In  conventional  stereo  two 


images  are  recorded  simultaneously,  or  near!/  so, 
by  laterally  displaced  cameras.  In  motion  parallax 
two  or  more  images  are  recorded  sequentially, 
usually  with  a  single  camera  that  moves  in  an 
arbitrary  direction.  In  a  sense,  conventional 
stereo  can  be  considered  to  be  a  special  case  of 
motion  parallax,  and  the  same  geometrical 
formalisms  apply  to  both.  In  practice,  they  are 
often  treated  in  different  ways  and  are  used  in 
different  domains;  for  example,  motion  parallax 
stereo  forme  the  basis  for  moat  of  the  cartographic 
products  derived  from  aerial  surveys  while 
conventional  stereo  is  preferred  for  three- 
dimensional  biological  lugging  systems. 

Stereo  Is  an  attractive  source  of  information 
for  machine  perception  because  it  leads  to  direct 
range  measurements,  and,  unlike  monocular 
approaches,  does  not  merely  infer  depth  or 
orientation  through  the  use  of  photometric  and 
statistical  assumptions.  Once  the  stereo  images# 
are  brought  into  point-to-point  correspondence,  the 
recovery  of  range  values  is  a  relatively 
straightforward  matter.  Furthermore,  ttereo  is  a 
passive  method.  Active  ranging  methods  that  use 
structured  light,  laser  rangefinders,  or  other 
active  sensing  techniques  are  useful  in  l  Ightly 
controlled  domains,  such  as  industrial  automation 
applications,  but  are  clearly  unsuitable  for  most 
machine  vision  problems. 

Perhaps  the  most  common  use  of  coiq>utational 
stereo  la  In  the  interpretation  of  aerial  images. 
Other  applications  are  passive  navigation  for 
autonomous  vehicle  guidance,  industrial  automation 
applications,  and  the  interpretation  of  mlcro- 
stereophotographs  for  biomedical  applications. 

Each  domain  has  different  requirements  that  can 
affect  the  design  of  a  complete  stereo  system. 


II  THE  COMPUTATIONAL  STEREO  PARADIGM 


Research  on  co-uputational  solutions  for  the 
generalized  stereo  problem  has  followed  a  single 
paradigm,  although  there  have  been  several  distinct 
variations,  both  in  method  and  Intent.  The 
paradigm  Involves  the  following  steps. 


*  The.  Research  described  in  thi3  paper  is  based  on  work  performed  under  Adavnced  Research  Projects  Agency 
Contract  No.  MDA903-79-C-0588. 
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*  Image  acquisition 

*  Cakcra  modeling 

*  Feature  acquisition 

*  Matching 

*  Distance  (depth)  determination 

*  Interpolation 


A.  Image  Acquisition 

The  most.  Important  factor  affecting  image 
acquisition  is  the  specific  application  for  which 
the  stereo  cocqmtation  is  intended.  Three 
applications  have  received  the  most  attention:  the 
interpretation  of  aerial  photographs  for  automated 
cartography,  guidance  and  obstacle  avoidance  for 
autonomous  vehicle  control,  and  the  modeling  of 
human  scereo  vision. 

Aerial  photo-interpretation  involves  low- 
oblique  and  usually  low-resolution  images  of  a 
variety  of  terrain  types.  The  stereo  Images  are 
usually  interpreted  as  pairs,  rather  than  as  longer 
sequences  involving  more  than  two  Images. 

Stereo  for  autonomous  vehicle  control  has  been 
studied  in  two  contexts:  as  a  passive  navigation 
aid  for  drone  aircraft,  and  as  a  control  system  for 
surface  vehicles.  The  images  used  for  aircraft 
navigation  are  similar  to  the  aerial  photographs 
used  for  cartography,  except  that  long  sequences  of 
Images  are  used,  and  multispect ral  sensors  are 
often  employed.  The  images  used  for  surface 
vehicle  control  are  quite  different  --  they  are 
high-. ollque,  comparatively  high-resolution  Images. 

Research  on  computational  models  of  human 
stereo  vision  have  mostly  used  synthetic  random-dot 
stereograms  as  their  subject  matter  for 
experimental  Investigation;  the  primary  reason  ior 
this  is  that  random-dot  stereograms  exclude  all 
monocular  depth  cues,  and  the  exact  correspondences 
are  known.  Because  the  parameters  of  a  random-dot 
stereograms,  such  as  noise  and  density,  can 
controlled,  they  allow  systematic  comparison  of 
human  and  machine  performance.  Grimson  has  also 
reported  experimental  results  with  natural  Imagery 
U). 

Perhaps  the  most  significant  and  widely 
recognized  difference  in  scene  domains  Is  the 
difference  between  scenes  containing  cultural 
features  such  as  buildings  and  roads,  and  those 
containing  only  natural  objects  and  surfaces  such 
as  mountains,  flat  or  "rclllng"  terrain,  foliage, 
and  water.  Important  stereo  applications  range 
over  both  domains.  Low-resolution  aerial  imagery, 
for  example,  usually  contains  mostly  natural 
features,  although  cultural  features  are  sometimes 
found.  Industrial  applications,  on  the  other  hand, 
tend  to  involve  man-made  objects  exclusively. 
Cultural  features  present  special  problems.  For 
example,  periodic  structures  such  as  the  windows  of 
buildings  and  road  grids  can  confuse  a  stereo 


matcher.  The  relative  abundance  nf  occlusion  edges 
in  a  city  scene  also  causes  problems  because  large 
portions  of  the  Images  may  be  unmatchuble. 

Cultural  objects  often  have  large  surfaces  with 
nearly  uniform  albedo  tnst  ate  difficult  to  match 
because  of  «■  lack  of  detail.  Stereo  systems  that 
have  been  reported  in  the  literature  are  uau.nl.ly 
targeted  at  a  opecific  aceno  domain,  and  there  Is 
seldom  any  attempt  to  validate  the  methods  In  other 
domains. 

In  summery,  the  key  parameters  associated  with 
image  acquisition  tre: 

*  Scene  domain 

*  Timing 

-  Simultaneous 

-  Nearly  simultaneous 

-  Radically  different  times 

*  Time  of  day  (lighting  and  presence  of 
shadows) 

*  Photometry  (including  spectral  windows) 

*  Resolution 

*  Tield  of  view 

*  Relative  camera  positioning  (length  and 
orientation,  relative  to  the  scene,  of  the 
stereo  base  line). 

The  Issues  associated  with  the  scene  domain 
are  percentage  of: 

*  Occlusion 

*  Man-made  objects  (straight  edges,  flat 
surfaces) 

*  Continuous  surfaces  of  some  minimal  extent 

*  Textureless  area 

*  Area  containing  repetitive  structure. 

B.  Camera  Modeling 

Perspective  geometry  can  be  used  to  constrain 
the  search  for  matches  to  one  dimension.  The 
extended  line  connecting  the  perspective  centers  of 
two  cameras  is  called  the  air  base;  the  points 
where  the  ait  base  intersects  the  image  planes  are 
the  epipoles;  and  a  plane  that  contains  the 
epipoles  is  an  eplpolar  plane.  Every  point  in  one 
Image  of  a  stereo  pair  defines  an  eplpolar  plane, 
and  the  corresponding  point  in  the  other  image  mu3t 
lie  in  the  same  plane.  The  search  for  a  match  of  a 
point  in  the  first  image  may  therefore  be  limited 
to  the  line  in  the  second  Image  that  Is  the 
intersection  of  the  eplpolar  plane  with  the  image 
plane,  commonly  called  an  eplpolar  line. 

If  the  stereo  pair  is  "perfect,"  the  eplpolar 
lines  are  coincident  with  the  horizontal  scan  lines 
—  ?  convenient  situation  because  the  matching 
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process  can  be  made  relatively  simple  and 
efficient.  Stereo  systems  that  have  been  primarily 
concerned  with  modeling  hitman  ability  have  used 
this  approacn  [1,2J-  In  practical  applications, 
however,  the  stereo  pair  may  be  imperfec:.  In 
aerial  stereo  photogtammetry ,  for  example,  the 
remera  axis  may  typically  be  tilted  as  much  as  two 
to  three  degrees  from  vertical  [3].  the 
implication  here  la  that  points  on  a  Stan  line  In 
cne  Image  will  not  fall  va  a  single  scan  line  in 
the  second  image  of  the  9tereo  pair,  and  thus,  cue 
computational  cost  to  employ  the  eplpolar 
constraint  is  significantly  increased- 

The  relative  position  of  the  two  camera 
positions  is  called  the  camera  model.  Camera 
models  are  Important  because  they  allow  one  to 
exploit  the  eplpolar  constraint.  In  most  cases, 
considerable  a  priori  knowledge  of  the  camera  r,iouel 
is  available,  but  It  Is  jften  not  as  accurate  a r, 
desired.  Cannery  [4]  has  developed  a  method  for 
solving  for  tae  camera  model  from  a  relatively  few 
sparse  matches.  His  method  accounts  for 
differences  in  azimuth,  elevation,  pan,  tilt,  roll, 
and  focal  length. 

Fischler  and  Bolles  [5]  have  provided  a  number 
of  results  with  respect  to  the  minimum  number  cf 
points  needed  to  obtain  a  colutlon  to  the  camera 
calibration  problem,  given  a  single  Image  and  a  set 
of  correspondences  between  points  in  the  image  and 
tneir  spatial  (geographic)  locations;  they  also 
provide  a  technique  for  solving  for  the  complete 
camera  model,  even  when  the  given  correspondences 
contain  a  large  percentage  of  errors.  While  this 
work  was  directed  at  the  problem  of  establishing  a 
mapping  between  an  image  and  an  existing  geographic 
database,  it  is  obviously  possible  to  apply  the 
results  to  the  t.tereo  problem,  and  in  fact,  tying 
the  stereo  pair  to  an  existing  database  offers  the 
possibility  of  employing  constraints  beyond  thooe 
available  from  the  imaging  geometry. 

Camera  modeling  can  be  exter-ded  to  include 
distortions  introduced  in  tne  image-making  process. 
Significant  imu^e  distortion  will  degrade  the 
accuracy  of  depth  measurements  made  by  a  stereo 
system  unless  corrected.  Two  kinds  of  image 
distortion  ure  found;  radial  and  tangential. 

Radial  distortion  causes  image  points  to  be 
displaced  perpendicular  to  the  optical  axis  and  may 
occur  In  the  form  of  pin-cushion  distortion  (i.e., 
positive  radial  distortion)  or  barrel  distortion 
(i.e.,  negative  radial  distortion).  Tangential 
distortion  la  caused  by  imperfect  centering  of  lens 
elements,  resulting  in  image  displacements 
perpendicular  to  the  radial  lines.  Moravec 
described  a  method  to  correct  for  distortion  using 
a  square  pattern  of  of  deta  (6).  Fourth  degree 
polynomials  are  found  that  transform  the  measured 
positions  of  the  dots  to  their  nominal  positions. 

In  summary,  the  important  issues  in  camera 
modeling  are: 

*  A  priori  knowledge  of  camera  positions 

*  Solutions  using  a  few  sparse  matches 

*  Use  of  known  scene  coordinates 


*  Ability  tD  deal  with  matching  errors 

*  Compensation  for  image  distortion 


C.  Feature  Acquisition 

That  ‘eaiureless  areas  of  nearly  homogeneous 
brightness  cannot  be  matched  with  confluence  ic 
widely  recognized.  Accordingly,  most  u.'.rk  in 
computational  stereo  has  included  some  form  of 
local  feature  detection,  the  particular  form  of 
which  is  closely  coupled  with  the  matching  strategy 
used. 

Approaches  that  apply  area  crose-corre.Jation 
matching  often  use  an  interest  operator  to  locate 
places  in  one  im/ae  that  can  be  matched  with 
confidence.  One  way  to  do  this  ..a  to  select  areas 
that  have  high  variance.  These  areas  will  not  be 
good  features,  however,  if  the  variance  is  due  only 
to  brightness  differences  In  the  direction 
perpendicular  to  the  eplpolar  line.  Theee  areas 
esu  be  culled  by  demanding  that  the  two-dimensional 
autocorrelation  function  have  a  distinct  peak. 
Another  widely  usee  interest  operator  is  the 
Moravec  operator  16),  which  selects  points  that 
have  high  variance  between  adjacent  pixels  in  four 
directions.  Hannah  has  modified  this  operator  to 
consider  ratios  of  directional  variances  as  well  as 
ordinary  isotropic  variance  over  larger  areas,  and 
this  modified  operator  seems  to  locate  a  better 
selection  of  both  strong  and  subtle  features. 

Feature,  detection  is  more  centrally  Important 
to  those  approaches  that  directly  match  features  in 
the  stereo  images.  The  features  that  are  used  may 
vary  in  size,  di-:ectlon,  and  dimensii  ility. 

Point-like  features  are  good  candidates  for 
matching  when  the  camera  model  is  unknown  and  the 
matches  are  not  constrained  to  epipolar  lines. 

This  is  because,  unlike  linear  features,  points  are 
unambiguously  located  in  the  image  and  can  te 
matched  in  any  direction.  Linear  features  must  be 
oriented  across  the  eplpolar  lines  if  they  are  to 
b«.  matched  accurately.  Another  advantage  of  point¬ 
like  features  is  that  they  can  be  matched  without 
concern  for  perspective  distortion.  In  area- 
correlation  approaches  point -.like  features  are 
often  used  to  obtain  the  camera  model  prior  to  more 
extensive  matching.  The  local  Intensity  values 
arcund  a  point  can  be  used  to  establish  Initial 
confidences  of  matches  in  a  way  similar  to  area 
correlation  [7]. 

If  the  camera  model  is  known  a  priori  or 
derived  in  a  preliminary  step,  edge  elements  can  be 
used  as  primitive  matching  features.  The 
computational  model  of  human  stereo  vision 
described  be  Marr  and  Poggio  uses  zero-crossings  in 
the  convolution  o*  a  circularly  symmetric  Laplacian 
with  a  Gaussian  low-pass  filtered  image  [1],  [2]. 
The  zero-crossings  are  found  in  the  convolution  of 
four  differently  sized  masks.  Arnold  uses  the 
Huecke.1.  operator  to  find  linear  features,  but  the 
operator  has  a  fixed  size  [9],  Baker  uses  zero- 
crossings  cf  a  one-dimensional  operator,  again  of  a 
fixed  size  [10] . 
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Many  distinct,  edge  models  have  been  proponed 
as  the  basis  £or  edge-detecting  algorithms.  In  the 
case  of  "strong"  edges,  most  of  the  resulting 
algorithms  yield  similar  results  for  operators  of 
comparable  sires.  Often  the  same  underlying  model 
appears  In  different  implementations;  e.g.,  zero- 
crosslng'J  In  the  second  derivative  are  equivalent 
to  local  maxima  in  the  first  derivative,  and  most 
of  the  conventional  edge  detection  methods  search 
for  approximations  to  first  derivative  maxima. 

More  important  is  how  the  edge  attributes  can  be 
used  for  matching;  site,  direction,  and  magnitude 
(contrast)  have  been  used,  but  their  relative 
merit  is  not  established. 


For  the  most  part,  low  level  features  have 
been  used  for  stereo.  What  wo  mean  by  "low  level" 
is  that  the  features  depend  only  cn  local  monocular 
Intensity  patterns,  end  are  based  on  the  assumption 
that  more-or-less  sharp  Intensity  gradients  are 
usually  due  to  physically  significant  structural, 
reflectance,  and  Illumination  events  la  the  scene. 
Higher  level  features  that  depend  on  more 
sophisticated  semantic  analysis  have  been  largely 
unused  (Ganaparthy  described  a  system  for  matching 
vertices  In  blocks-world  stereo  scenes  across  very 
large  viewing  angles  (111).  The  ability  to 
classify  edges  as  occlusion  or  nonocclusion 
boundaries,  for  example,  could  be  very  useful  to  a 
stereo  system,  especially  in  the  difficult  domains 
that  Include  a  wealth  of  cultural  features. 

In  summary,  the  properties  of  local  features 
that  are  Important  to  the  computational  stereo 
problem  ares 

*  Dimensionality  (point-like  versus  edge¬ 
like) 

*  Size  (spatial  frequency) 

*  Magnitude  (contrast) 

*  Semantic  content 

*  Density  of  occurence 

*  Easily  measurable  attributes 

*  Unlqueness/dlstlnguishabillty • 


D .  Matching 

Image  matching  Is  a  core  area  in  scene 
analysis  and  will  not  be  covered  in  In  full  detail 
in  this  paper.  Instead,  we  will  focus  on  those 
portions  of  the  imrge-matehing  problem  that  are 
directly  relevant  to  stereo  modeling-  Features 
that  distinguish  stereo  image  matching  from  image 
matching  in  general  are  the  following: 

*  Images  are  taken  at  approximately  the  same 
time  and  from  the  same  viewpoint  in  space. 
Thus,  illumlnatlon/shadow  conditions  are 
the  same  valthough  there  can  be  significant 
differences  in  specular  reflection) •  Most 
of  the  significant  changes  will  occur  In 


the  appearance  of  neerby  objects  end  in 
occlusiona.  Additional  changes  in  both 
geometry  and  photometry  ccn  be  introduced 
in  the  film  development  and  scanning  steps, 
hut  can  usually  be  avoided  by  cerefuJ 
processing. 

*  Stereo  modeling  generally  requires  that  a 
dense  grid  of  points  be  matched. 

Ideally,  we  would  like  to  find  the 
correspondences  (i.e.,  the  matched  locations)  of 
every  individual  pixel  in  both  images  of  a  stereo 
pair.  However,  it  is  obvious  that  the  information 
content  in  the  intensity  value  of  a  single  pixel  is 
too  low  for  unambiguous  matching.  In  practice, 
coherent  collections  of  pixels  are  matched.  These 
collections  are  determined  and  matched  in  two 
distinct  ways  (see  the  discussion  in  the  preceding 
section  on  feature  acquisition): 

*  Area  Matching:  Xegularly  sized 
neighborhoods  of  a  pixel  are  the  basic 
units  that  are  matched.  This  approach  is 
justified  by  the  "continuity  assumption," 
which  asserts  thet  at  the  level  of 
resolution  at  which  stereo  matching  is 
feasible,  most  of  the  image  depicts 
portions  of  continuous  surfaces;  therefore, 
adjacent  pixels  in  an  image  will  generally 
represent  contiguous  points  in  space.  This 
approach  la  almost  invariably  accompanied 
by  correlation  matching  to  establish  the 
correspondences . 

*  Feature  Matching:  "Semantic,  features"  (with 
known  physical  properties  and/or  spatial 
geometry),  or  "intensity  anomaly  features" 
(isolated  anomalous  intensity  patterns  not 
necessarily  having  any  physical 
significance),  are  the  basic  units  that  are 
matched.  Semantic  features  of  the  generic 
type  include  occlusion  edges,  vertices  of 
linear  structures,  and  prominent  surface 
markings;  domain-specific  semantic  features 
might  include  such  features  as  the  corner 
or  peak  of  a  building,  or  a  road  surface 
marking;  intensity  anomaly  features  include 
those  such  as  zero  crossings  and  image 
patches  found  by  the  Moravec  interest 
operator.  Methods  used  for  feature 
matching  often  include  symbolic 
classification  techniques,  as  well  as 
correlation. 

Obviously,  feature  matching  alone  cannot 
provide  the  desired  dense  depth  map  so  it  must  be 
augmented  by  a  model-based  interpretation  step 
(e.g.,  we  recognize  the  edges  of  buildings  and 
assume  that  the  intermediate  space  is  occupied  by 
planer  walls  and  roofs),  or  by  area  matching.  When 
used  in  conjunction  with  area  matching,  the  feature 
matches  are  generally  considered  to  be  more 
reliable  and  can  constrain  the  search  for 
correlation  matches. 

To  further  reduce  the  possibility  of  error 
caused  by  an  ambiguous  match,  a  numner  of 
hierarchical  and  global  matching  techniques  have 
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been  employed,  Including  relaxation  matching  and 
various  "coarse-fine"  hierarchical  matching 
strategies • 

The  correlation-matching  approach  attempts  to 
resolve  ambiguity  by  using  as  much  local 
information  as  possible  to  make  decisions  about 
potential  matches,  but  each  match  decision  is  made 
independently  of  the  others.  The  relaxation- 
labeling  approach  uses  a  relatively  small  amount  of 
local  information  for  each  potential  match,  and 
attempts  co  resolve  ambiguity  by  finding 
consensuses  among  subsets  of  the  total  population 
of  matches,  relying  on  the  three-dimensional 
continuity  /f  surfaces  to  be  reflected  in  the  two- 
dimensional  continuity  of  disparity.  A  method  for 
avoiding  ambiguity  that  can  be  applied  to  both 
correlation  matching  [6]  and  feature  point  matching 
[2]  is  the  so-called  "coarse-fine"  strategy.  In 
this  approach  coarse  disparities  are  found 
relatively  quickly,  but  with  low  accuracy.  These 
gross  disparities  are  used  to  bias  f iner-resolution 
matching.  Even  with  a  coarse-fine  strategy, 
however,  some  ambiguity  at  each  level  of  resolution 
Is  inevitable.  The  best  combination  of  ambiguity 
avoidance  and  ambiguity  resolution  is  a  major 
research  issue. 

In  summary,  key  attributes  which  differentiate 
matching  techniques  include: 

*  Local  versus  global  ambiguity  resolution 

*  Area  (dense)  versus  feature  (sparse) 
matching. 

The  constraints  used  to  Loth  limit  computation 
and  reduce  ambiguity  Include: 

*  Epipolar 

*  Continuity 

*  Hierarchical  (e.g.,  coarse-fine  matching) 

*  Sequential  (e.g.,  feature  tracking  in 
sequential  views). 

Criteria  that  can  be  used  to  evaluate  (or 
compare)  different  matching  techniques  include: 

*  Accuracy  (match  precision  to  the  sub-pixel 
level) 

*  Reliability  (resistance  to  gross 
classification  errors) 

*  Generality  (applicability  to  different 
scene  domains) 

*  Predictability  (availability  of  performance 
models) 

*  Complexity  (cost  of  implementation; 
computational  requirements). 


E.  Distance  Determination 

With  few  exceptions,  work  in  image 
understanding  has  not  dealt  with  the  specific 
problem  of  distance  determination.  The  matching 
problem  la  has  been  considered  the  hardest  and  moat 
significant  problem  in  computational  stereo.  Once 
accurate  matches  have  been  found  the  determination 
of  distance  is  a  relatively  simple  matter  of 
triangulation;  nevertheless,  this  step  presents 
significant  difficulties,  especially  if  the  matches 
are  somewhat  inaccurate  or  unreliable. 

To  a  first  approximation,  the  accuracy  of 
stereo  distance  measurements  is  directly 
proportional  to  the  accuracy  of  the  matches  and 
inversely  proportional  to  the  length  of  the  stereo 
baseline.  We  have  discussed  how  lengthening  the 
stereo  baseline  complicates  the  matching  problem  by 
increasing  ambiguity,  and  how  various  matching 
strategies  have  been  used  to  overcome  this  problem 
(coarse/fine  strategies,  cooperative  or  relaxation- 
labeling  approaches,  and  incremental  stereo  views). 
The  role  that  accuracy  of  matches  plays  has  been 
leap  thoroughly  examined. 

In  many  cases,  matches  are  made  to  an  ac.uracy 
of  only  a  pixel.  However,  both  the  area 
correlation  and  the  feature-matching  approaches  can 
lead  to  better  accuracy.  Sub-pixel  accuracy  using 
area  correlation  requires  expensive  interpolation 
over  correlation  patches,  however,  and  also 
complicates  feature-matching  approaches. 

Another  approach  la  to  settle  for  one-pixel 
accuracy,  but  to  use  multiple  views  (6).  A  match 
frop  a  particular  pair  of  views  represents  a  depth 
estimate  with  uncertainty  that  depends  on  the  one- 
pixel  accuracy  tf  the  match  and  on  the  length  of 
the  stereo  baseline.  Matches  from  many  pairs  of 
views  can  be  statistically  averaged  to  find  a  more 
accurate  estimate.  The  contribution  of  a  match  to 
the  final  depth  estimate  can  be  weighted  according 
to  any  factors  that  bear  on  the  confidence  of  the 
match  and  on  its  accuracy. 

In  summary,  better  depth  measurements  can  be 
obtained  in  several  wtys,  each  Involving  overhead: 

*  Sub-pixel  matching 

*  Increased  stereo  baseline 

*  Statistical  averaging  over  several  views. 


F.  Interpolation 

As  previously  mentioned,  stereo  applications 
usually  demard  a  dense  array  of  depth  estimates 
that  the  feature  matching  approach  cannot  provide 
because  features  are  sparsely  and  irregularly 
distributed  over  the  Images.  The  area  correlation- 
matching  approach  is  more  suited  to  obtaining  danse 
matches,  although  it  tends  to  be  unreliable  in 
areas  of  low  information.  Consequently,  tome  kind 
of  interpolation  step  ia  required. 
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The  most  straightforward  way  to  create  the 
dense  depth  array  from  a  sparse  array  Is  simply  to 
treat  the  sparse  array  as  a  sampling  of  a 
continuous  depth  function,  and  to  approximate  the 
continuous  function  using  a  conventional 
Interpolation  method  (for  example,  by  fitting 
splines).  Assuming  the  sparse  depth  array  Is 
complete  enough  to  capture  the  Important  changes  In 
depth,  natisfylng  the  conditions  of  the  sampling 
theorem,  this  approach  nay  be  adequate.  Aerial 
stereophotographs  of  rolling  terrain,  for  exanple, 
might  be  handled  In  this  way.  In  many 
applications,  however,  the  continuous  depth 
function  model  will  not  be  appropriate  because  of 
occlusion  edges. 

A  more  sophisticated  approach  to  the 
Interpolation  problem  Is  to  fit  a  priori  geometilc 
models  to  the  sparse  depth  array.  Normally,  model 
fitting  would  be  preceded  by  clustering  to  find  the 
subsets  of  points  that  correspond  to  significant 
structures.  Each  cluster  would  then  be  fit  to  the 
best  available  model,  thereby  Instantiating  the 
model's  free  variables  and  providing  an 
Interpolation  function.  This  approach  has  been 
used  to  find  ground  planes  [9],  elliptical 
structures  In  stereophotographs  [12],  and  smooth 
surfaces  In  range  data  acquired  with  «  laser 
rangefinder  (13]. 


Ill  EVALUATION 


The  following  criteria  are  appropriate  for 
evaluating  both  complete  stereo  systems  and  the 
components  of  such  systems.  More  specialized 
criteria  relevant  to  Individual  components  of 
stereo  systems  were  presented  In  previous  sections 
of  this  paper. 

(1)  Disparity  -  what  range  of  disparity  Is 
handled?  One  possible  advantage  of 
automated  stereo  analysis  Is  that 
computer  methods  may  be  able  to  handle 
larger  angular  disparities  than  humans 
can.  Larger  disparities  lead  to  more 
accurate  depth  measurements,  but  also  to 
more  difficult  matching. 

(2)  Coverage  -  what  precentage  of  the  Beene 
is  matched?  Also,  how  widely  are  the 
matches  distributed?  Clearly,  large, 
featureless,  homogeneous  areas  canr.jt  be 
readily  matched.  What  kinds  of 
Interpolation  techniques  can  be  used  to 
extend  disparity  across  such  areas  7  What 
monocular  techniques  can  be  used  to 
enhance  coverage  (for  example, 
photometric  evidence  for  smooth 
surfaces) ? 

(3)  Accuracy 

(4)  Reliability  -  how  many  false  matches  are 
made  coopered  to  valid  matches?  What 
methods  are  effective  for  detecting  and 
eliminating  false  matches? 


(5)  Domain  sensitivity  -  what  range  of  scene 
domains  can  be  handled? 

(6)  Efficiency  -  actual  timings  of  stereo 
systems  will  probably  not  be  useful 
because  of  nonoptlmal  ing)  lementat Iona  and 
differences  In  hardware.  Comparisons 
based  on  computational  complexity  can  be 
made,  however.  How  does  the  time 
required  for  stereo  compilation  scale 
with  the  Image  size,  with  the  range  of 
disparity,  and  with  other  Important 
parameters?  How  amenable  to  hardware 
implementation  are  the  different  methods? 
What  efficiency  is  needed  for  useful 
automated  stereo  systems? 

(7)  Human  engineering  -  how  are  the  results 
displayed  (perspective  3D  plots,  false 
coloring,  countour  plots,  vector  fields, 
etc.)?  What  are  the  best  methods?  Is 
human  Interaction  allowed? 

(8)  Sources  of  data  for  experimental 
validation 

(a)  Synthetic  images  or  scaled  models 
(model  boards) 

*  Advantages: 

Cheap 

Certainty  about  actual 
depths 

Control  over  secondary 
parameters 

*  Disadvantages: 

Not  as  complex  as  real- 
world  scenes 

Not  representative  of  any 
real  image  domain 

(b)  Ground  surveys. 

*  Advantages: 

Realistic 

Certainty  about  actual 
depths 

*  Disadvantages: 

Expensive  (hence  limited 
number  of  sites  that 
can  be  surveyed) 

(c)  Compare  to  human  performance. 

*  Advantages : 

Realistic 

Reasonably  Inexpensive 

*  Disadvantages: 

Susceptible  to  human  errors 

IV  SURVEy 


This  survey  covers  a  /reprt  sentive  sampling  of 
the  image  understanding  work  relevant  to 
computational  stereo.  While  not  exhaustively 
covering  the  field,  It  does  contain  examples  of  all 
the  significantly  different  approaches  to  the  steps 
In  the  computational  stereo  paradigm.  The  work 
dascussed  In  the  survey  is  roughly  grouped 
according  to  the  research  centers  where  the  primary 


162 


Investigators  are  resident,  although  exceptions 
will  be  found.  Other  organizations  were 
considered,  but  none  syas  entirely  satisfactory. 


Control  Data  Corporation 

A  flexible  approach  to  digital  stereo  mapping,  [141 

This  work  is  concerned  with  the  automation  of 
stereo-mapping  functions.  The  primary  concerns 
have  been  with  handling  different  kinds  of  terrain 
and  sensors,  efficient  hardware  Implementation,  and 
the  development  of  an  interactive  mapping  system. 

A  regularly  spaced  grid  of  points  In  the  left 
image  is  matched  in  the  right  image.  Matching  is 
done  by  searching  along  the  corresponding  epipolar 
line  in  the  right  image  for  a  maximum  correlation 
patch,  which  is  warped  to  account  for  predicted 
terrain  relief  (estimated  from  previous  matches). 
Sub-pixel  matches  are  obtained  by  fitting,  a 
quadratic  to  the  correlation  coefficients  and 
picking  the  Interpolated  maximum. 

"Tuning  parameters"  may  be  dynamically  altered 
to  adapt  the  system  to  sensor  and  terrain 
variations.  Some  tuning  parameters  are  grid  limits 
and  Interval  sizes;  patch  size  and  shape;  number  of 
correlation  sites  along  the  search  segment; 
prediction  function  weighting  coefficients;  and 
reliability  thresholds  for  the  correlation 
coefficient,  standard  deviation,  prediction 
function  range,  and  slope  of  the  correlation 
function.  The  intent  is  to  choose  the  smallest 
feasible  patch,  subject  to  the  need  to  compensate 
for  noise  and  lack  of  intensity  variation  in  the 
image • 

A  continuity  constraint  is  used  to  predict 
matches.  The  rate  of  change  of  disparity  is 
assumed  to  be  continuous.  This  constraint  is  also 
used  to  shape  the  correlation  patches  in  the  left 
image  (using  a  linear  interpolation  and  b<  linear 
resampling). 

The  reliability  of  matching  is  continuously 
monitored  to  signal  when  parameters  become 
Inappropriate  or  when  the  photometry  prevents  valid 
matching.  Reliability  is  estimated  with  a 
combination  of  correlation  coefficient,  patch 
standard  deviation  (are  features  present?), 
distance  of  maximum  from  predicted  point, 
prediction  function  limits,  and  slope  of  the 
correlation  function. 

The  system  is  Implemented  on  a  highly  parallel 
conf lguaration  of  4  CDC  Flexible  Processors,  each 
capable  of  8  MIPS. 

A  somewhat  different  approach  haB  been  taken 
for  three-dimensional  modeling  of  cultural  sites 
(e.g. ,  building;  complexes)  from  high-resolution 
Images.  The  basic  idea  is  to  identify 
corresponding  points  of  intersection  between 
epipolar  lines  and  edges  in  the  two  images  of  a 


stereo  pair.  Mon-matched  edges  ere  assumed  to  be 
due  to  noise  or  occlusions.  Deptn  along  an 
epipolar  line  (corresponding  to  a  'hree-dimeusionsl 
profile  line  in  the  scene)  is  assumed  to  vary 
linearly  between  contiguous  pairs  of  matched 
intersections.  Special  techniques  are  developed  to 
deal  with  occlusions  and  "reversals ."  Edge¬ 
tracking  across  sequential  epipolar  llnvs  (the 
continuity  constraint)  contributes  to  reliability. 


Lockheed 

Bootstrap  stereo,  [15] 

The  goal  of  this  study  is  navigation  of  an 
autonomous  aerial  vehicle  using  passively  sensed 
images,  using  a  method  called  "bootstrap  stereo." 
Ground  control  points  are  used  to  determine  the 
vehicle's  location  and  a  camera  model  is  used  to 
locate  further  correspondences.  Major  components 
of  the  system  are  camera  calibration,  new  landmark 
selection,  matching,  and  control  point  positioning. 
The  complete  system  will  consist  of  several 
navigation  "specialists,"  including  ones  using 
instrumentation  (altimeter,  airspeed  indicator, 
attitude  gyros),  dead  reckoning,  landmarks,  and 
stereo. 

Camera  calibration  is  achieved  with  standard 
least-squares  methods  to  determine  position  and 
orientation  of  the  camera. 

New  landmark  selection  involves  an  adaptation 
of  the  Moravec  operator  that  uses  ratios  of 
directed  variance  along  two  orthogonal  directions 
(Instead  of  simply  the  directed  variance  in  four 
directions). 

Point  matching  is  accomplished  with  normalized 
cross-correlation  using  a  spiraling  grid  search. 
Coarse  matching  is  used  to  approximately  register 
the  images  and  to  initialize  second-order 
prediction  polynomials.  Autocorrelation 
thresholding  is  used  to  evaluate  the  reliability  of 
matches  (Good  matches  have  sharply  peaked 
autocorrelation  functions.).  Subpixel  matching 
accuracy  is  achieved  through  parabolic 
interpolation  of  the  correlation  values. 

Control  point  positioning  involves  determining 
the  depth  of  matched  points.  It  is  done  with 
straightforward  triangulation. 


Stanford 

(1)  Stereo-camera  calibration,  [4] 

A  method  for  determining  the  relative  position 
and  orientation  of  two  cameras  from  a  set  of 
matching  points  is  developed.  The  calibration 
accounts  for  difference  in  azimuth,  elevation,  pan, 
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tilt,  roll,  and  focal  length.  The  basic  method  la 
a  least-squares  minimization  of  the  errors  of  the 
distances  of  points  in  image  2  from  their  predicted 
locations,  as  determined  by  their  positions  in 
image  1  and  an  estimated  camera  model.  The 
nonlinear  optimization  problem  ia  solved  by 
Iterating  on  a  linearization  of  the  problem, 

(2)  Local  context  in  matching  edges  for  stereo 

vision,  [9] 

This  approach  matches  corresponding  features 
instead  of  matching  areas  using  cross  correlation. 
Features  are  edge  elements  produced  by  a  Hue eke  1 
operator.  The  approach  uses  a  continuity 
constraint  to  resolve  ambiguity.  If  a  scene  is 
continouous  in  three  dimensions  then  adjacent 
matching  edge  elements  should  be  continuous  in 
direction  and  disparity.  Intensities  on  either 
side  of  the  edge  should  also  be  consistent. 

The  Horavec  operator  is  used  to  select  about 
50  points,  A  coarse/fine  search  finds  matches  for 
some  of  these  points,  and  Gennery's  camera  model 
solver  is  used  to  determine  the  parameters  that 
relate  the  two  camera  positions,  A  ground  plane  is 
fit  to  the  matches  (few  points  may  lie  below  the 
ground  plane,  some  may  be  above  it,  and  as  many  as 
possible  lie  on  it),  A  Hueckel  operator  is  applied 
to  both  images  (3,19  pixel  radius),  and  the  results 
are  transformed  into  a  normalized  coordinate  system 
such  that  the  stereo  axis  is  in  the  x  direction  and 
points  on  the  ground  plane  have  zero  disparity. 

Each  edge  element  in  the  left  picture  la 
matched  to  nearby  candidates  in  the  right  image 
(there  are  usually  about  eight  candidates)  based  on 
the  angle  and  brightness  information  supplied  by 
the  Hueckel  operator.  Each  edge  in  the  left  image 
is  then  linked  to  all  its  neighbors  that  seem  to 
arise  from  the  same  physical  edge,  (Two  edges  are 
neighbors  if  they  are  close,  have  roughly  the  same 
angle,  and  similar  brightness.  Three  or  four  are 
typically  found.)  The  linked  neighbors  of  an  edge 
element  vote  to  determine  which  of  the  candidate 
disparities  is  most  consistent. 

Some  problems  caused  by  the  Hueckel  operator 
are  identified  (for  example,  it  is  unreliable  for 
corners,  textured  areas,  and  slow  gradients).  The 
author  suggests  relaxation  as  a  way  to  uce  context 
in  a  more  controlled  way  (see  [7]),  The  system 
works  well  in  scenes  of  man-made  objects,  but 
poorly  in  natural  scenes  (the  opposite  of  area 
correlation). 

(3)  Object  detection  and  measurement  using  stereo 

vision,  (12] 

This  study  uses  stereo  or  rangefinder  data  to 
detect  and  measure  objects,  and  although  it  does 
not  deal  with  the  matching  problem,  it  is  relevant 
to  the  interpolation  and  interpretation  problems. 
The  system  is  intended  for  autonomous  vehicle 
guidance  and  obstacle  avoidance. 

First,  the  ground  surface  is  found  as 
described  by  Arnold  in  [9] .  Above-ground  points 
are  clustered  and  ellipsoids  are  fit.  Clustering 


is  done  with  a  minimal  spanning  tree  approach.  The 
author  suggests  the  us \  of  relaxation  for 
clustering.  Next,  the  ellipsoids  are  adjusted  to  a 
better  fit  with  a  modified  leas t-squores  method. 

Two  types  of  errors  are  considered:  the  amount  by 
which  the  polntB  in  a  cluster  being  fit  miss  lying 
un  the  ellipsoid,  and  the  amount  by  which  the 
ellipsoid  occludes  any  points  as  seen  from  the 
camera.  (Orthographic  projection,  not  central 
projection,  is  assumed.)  In  addition,  there  is  an 
a  priori  bias  to  make  any  small  ellipsoids 
approximately  spherical. 

After  ellipsoids  have  been  fit  to  the  original 
clusters,  It  may  become  apparent  that  the  initial 
clustering,  based  on  only  local  information,  did 
not  produce  a  good  segmentation.  In  this  case,  the 
initial  clusters  are  either  split  or  merged  and 
another  set  of  ellipsoids  is  fit  to  them. 

Although  this  work  does  not  address  the 
central  problems  of  computational  stereo,  it  is  an 
interesting  way  of  both  smoothing  and  Interpreting 
raw  depth  information  made  available  from  stereo. 
The  ellipsoid  model  is  plausible  for  moon  rocks, 
but  probably  not  for  most  other  objects. 

(4)  Visual  mapping  by  a  robot  rover,  [6] 

This  is  a  study  of  autonomous  vehicle 
guidance.  Severe  noise  problems  are  overcome  by 
use  of  redundancy.  An  early  approach  that  used 
only  motion  stereo  was  found  to  be  unworkable 
because  of  matching  errors  and  uncertain  camera 
models.  The  latest  approach  uses  "Blider  stereo" 
to  obtain  nine  stereo  views  at  6.4  cm  intervals.  A 
calibration  step  determines  the  camera's  focal 
length  and  distortion  from  a  digitized  test 
pattern. 

At  interest  operator  is  used  to  select  good 
features  for  matching.  First,  for  each  point  it 
computes  the  variance  between  adjacent  pixels  in 
four  directions  over  a  square  (3x3  pixel) 
neighbor!1  ood;  next,  it  selects  the  minimum  variance 
as  its  iviv  erest  measure;  and  finally,  it  chooses 
feature  points  where  the  Interest  measure  is 
locally  maximal.  Intuitively,  each  chosen  point 
must  have  relatively  high  variance  in  several 
directions,  and  must  be  more  "interesting"  than  its 
immediate  neighbors.  The  interest  operator  is  used 
on  reduced  versions  of  the  images. 

A  binary  search  correlator  matches  6x6  areas 
denoted  by  features  in  one  image  to  areas  In 
another  image.  The  search  begins  at  the  lowest 
resolution  (xl6  reduction)  and  proceeds  to  the 
higher  resolutions.  In  this  way,  points  chosen 
from  the  center  view  are  found  in  the  other  eight 
views.  The  uncertainty  of  the  depth  measurement 
associated  with  a  match  is  inversely  nroportlonal 
to  the  length  of  the  stereo  baseline.  To  obtain 
more  accurate  depths,  the  measurements  are  averaged 
by  considering  each  of  the  stereo  baselines 
obtained  from  the  thirty-eix  combinations  of  nine 
views  taken  two  at  a  time.  A  measurement  from  a 
particular  pair  contributes  a  normal  distribution 
with  a  mean  at  the  estimated  distance  and  a 
standard  deviation  Inversely  proportional  to  the 
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atarso  baseline.  The  curves  are  also  normalized 
according  to  the  correlation  coefficients  of  the 
matches  (a  low  coefficient  reduces  the  area)  and 
according  to  the  degree  of  y-diaparity  (a  large  y- 
dlsparlty  reduces  the  area).  The  peak  In  the  sum 
of  these  distributions  gives  a  very  reliable  depth 
measurement . 

Depth  measurements  are  used  to  drive  the 
vehicle  In  about  one-meter  Increments.  Vehicle 
motion  is  deduced  from  depth  measurements  at  two 
positions  by  comparing  the  differences  of  point 
positions,  which  should  be  the  same  In  both  views. 
Each  point  is  modeled  as  a  sphere  whose  size 
depends  on  the  uncertainty  of  the  point's  position. 
The  vehicle  Is  modeled  as  a  three-meter  sphere. 
From  these,  a  near  optimal  path  is  determined  to  a 
goal  point. 


MIT 

(1)  Cooperative  computation  'of  stereo 

disparity,  [16] 

A  parallel,  "cooperative"  computation  model 
for  human  stereo  vision  Is  proposed.  This  feature 
matching  method  uses  two  constraints  to  match 
random-dot  stereograms.  The  features  that  are 
matched  are  the  dots  themselves.  The  constraints 
are:  uniqueness,  which  requires  that  every  feature 
have  a  unique  disparity  (a  consequence  of  Imaged 
points  on  three  dimensional  surfaces  having  unique 
depths);  and  continuity,  which  requires  that 
disparity  varies  smoothly  almost  everywhere  (except 
at  relatively  rare  occlusion  boundaries).  These 
constraints  are  applied  locally  over  several 
Iterations  with  an  algorithm  very  much  like 
relaxation-labeling.  Multiple  disparity  assign¬ 
ments  of  points  inhibit  one-another,  and  local 
collections  of  similar  disparities  support  one- 
another.  Although  this  algorithm  successfully 
fused  random-dot  stereograms,  the  authors  rejected 
It  as  a  model  of  human  atereopsls  and  proposed  a 
new  model  described  below. 

(2)  A  computational,  theory  of  human  stereo 

vision,  [2] 

A  computer  Implementation  of  a  theory 
of  human  stereo  vision,  [17] 

Aspects  of  a  theory  of  human  stereo 
vision,  [1] 

Matching  of  features  occurs  In  four 
Independent  channels  tuned  to  different  spatial 
frequencies.  The  matches  found  In  the  lower 
frequency  channels  establish  a  rough  correspondence 
for  the  higher  frequency  channels,  thereby  reducing 
the  number  of  false  matches. 

In  the  original  theory,  the  features  that  were 
proposed  were  zero-crossings  of  an  image  first  low- 
pass  filtered  and  then  convolved  with  bar  masks  of 
four  different  sizes  and  different  orientations, 
with  a  cross  section  that  was  a  difference  of 
Gaussis'  functions  with  space  constants  in  the 
ratio  of  1:1.75.  The  zero-crossings  after  a  second 


difference  operation  correspond  to  extrema  aftev  a 
first  difference  operation.  This  method  is 
therefore  a  way  of  finding  edgea  at  different 
scalee.  In  the  i^i  lament at  Ion  of  the  theory  bar 
maska  were  not  used;  lnatoad,  circularly  symmetric 
differences  of  Gausslens  were  used  to  approximate 
the  Laplaclan  of  e  Gaussian  distribution.  The 
convolutions  ware  done  on  a  LISP  machine  and 
apeclal-purpoae  hardware.  In  the  original  theory 
line  terminations  were  to  be  used  as  features, 
along  with  zero-crossings,  but  this  has  not  been 
Implemented. 

Zero-crossings  where  the  gradient  Is  oriented 
vertically  are  Ignored  (The  illicit  camera  model 
has  the  eplpolar  reye  oriented  horizontally.). 

Other  zero-crossings  are  located  to  an  accuracy  of 
one  pixel  and  their  orientations  (determined  by  the 
gradient  of  the  convoluton  values)  la  recorded  In 
Increments  of  30  degrees. 

Matching  within  any  given  channel  proceeds 
independently  of  other  channels.  First,  the  "eye 
position"  is  fixed  end  a  zero-croaalng  la  located 
In  one  image.  (The  eye  position  la  effectively  a 
rigid  translation  of  the  two  Images  with  respect  to 
one  another,  and  defines  a  continuous  mapping  of 
points  in  one  image  to  points  In  the  other.)  The 
region  surrounding  the  corresponding  point  In  the 
second  Image  is  then  divided  Into  three  pools  — 
two  larger  convergent  and  divergent  regions 
(towards  and  away  from  the  "nose",  respectively) 
and  a  smaller  null-vergence  region  centered  on  the 
predicted  match  location.  The  pools  together  span 
a  region  of  twice  the  width  of  the  central  positive 
region  of  the  convolution  mask.  Zero-crossing  from 
pools  In  the  second  Image  can  match  the  one  from 
the  first  Image  only  If  they  result  from 
convolutions  of  the  same  size  mask,  have  the  same 
sign,  and  have  approximately  the  same  orientation. 
If  a  unique  twitch  is  found  (l.e.,  only  one  of  the 
pools  has  a  zero-crossing  satisfying  the  above 
criteria),  tne  match  Is  accepted  as  valid.  If  two 
or  three  candidate  matches  are  found,  they  are 
saved  for  future  disambiguation.  Once  all  matches 
have  been  found  (ambiguous  or  not),  the  ambiguous 
ones  are  resolved  by  searching  through  the 
neighborhoods  of  points  to  determine  the  dominant 
disparity  (convergent,  divergent,  or  null) .  This 
is  the  familiar  continuity  constraint. 

It  may  be  the  case  that  the  disparity  of  a 
region  is  greater  than  the  range  handled  by  the 
matcher.  This  Is  detected  from  the  percentage  of 
unmatched  zero-crossings.  Marr  and  Pogglo  showed 
that  the  probability  of  a  zero-crossing  having  at 
least  one  candidate  match  In  this  situation  is 
about  0.7.  If  the  disparity  la  within  the  range  of 
the  matcher,  however,  the  probability  la  much 
higher. 

The  lower  frequency  matching  channels  are  used 
to  guide  the  "eyes"  to  bring  the  higher  frequency 
channels  into  range.  The  possibility  of  using 
other  sources  of  information  to  guide  the  eye 
movement  (In  particular,  texture  contours)  was 
mentioned  by  Grlmson  [1]. 
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SRI  International 

(1)  Parametric  correapondance  and  chamfer 

matching:  two  new  techniques  for  Image 
matching,  [18] 

A  method  for  matching  Images  to  a  three- 
dimensional  symbolic  reference  map  Is  presented* 

The  reference  map  Includes  point  landmarks, 
represented  with  three-dimensional  coordinates; 
linear  landmarks,  represented  as  curve  fragments 
with  lists  of  three-dimensional  coordinates;  and 
volumetric  structures,  represented  as  wire-frame 
models,  A  predicted  Image  Is  generated  from  an 
expected  viewpoint  by  projecting  three-dimensional 
coordinates  onto  image  coordinates,  suppressing 
hidden  lines.  The  predicted  image  is  matched  to 
Image  features,  and  the  error  Is  used  to  adjust  the 
viewpoint  approximation.  The  matching  is  done  by 
"chamfering,"  The  Image  feature  array  is  first 
transformed  Into  an  array  of  numbers  representing 
the  distance  to  the  nearest  feature  point.  The 
similarity  measure  Is  then  computed  by  summing  the 
distance  array  values  at  the  predicted  feature 
locations, 

(2) '  The  SRI  rood  expert:  image- to-d  at  abase 

correspondence,  [19] 

The  problem  of  matching  an  Image  to  a  database 
Is  studied.  The  Images  may  be  vary  for  several 
reasons:  different  camera  parameters,  lighting 
conditions,  cloud  cover,  etc.  The  method  that  is 
presented  begins  with  an  estimate  of  the  camera 
parameters.  Including  estimstes  of  uncertainties. 

It  refines  the  estimated  correspondence  by  locating 
landmarks  in  the  image  and  comparing  their  Image 
locations  to  their  predicted  locations.  The 
uncertainties  of  the  camera  parameter  estimates  are 
modeled  as  a  Joint  normal  distribution.  This  model 
implies  elliptical  uncertainty  regions  in  the 
image.  The  location  of  one  feature  constrains  the 
uncertainty  of  others  to  relative  uncertainty 
regions  (These  are  also  ellipses,  but  are  usually 
significantly  smaller  than  the  unconstrained 
regions) •  Two  kinds  of  matches  between  landmarks 
and  Image  features  are  used:  point-to-point  ar.d 
point-on-a-line.  The  point-to-point  matches  yield 
more  Information  for  refining  the  camera 
parameters,  but  the  polnt-on-a-line  matches  are 
more  numerous  and  cheaper  to  find,  A  modified 
version  of  Genncry's  calibration  method  (4]  is  used 
to  refine  the  camera  parameters, 

(3)  Random  sample  consensus:  a  paradigm  for  model 

fitting  with  applications  to  image  analysis 
and  automated  cartography,  [5] 

A  method  for  fitting  a  model  to  experimental 
data  is  developed  (RANSAC)  and  applied  to  the 
"location  determination  problem"  (l«e«,  given  a  set 
of  control  points  with  known  positions  in  some 
coordinate  frame,  determine  the  locations  from 
which  an  image  of  the  control  points  was  obtained). 
The  method  is  radically  different  from  conventional 
methods,  such  as  least  squares ,  which  begin  with 
large  amounts  of  data  and  then  attempt  to  eliminate 
Invalid  points,  RANSAC  unea  a  small,  randomly 
chosen  set  of  points  and  tnen  enlarges  this  set 


with  consistent  data  when  possible.  This  strategy 
avoids  e  common  problem  with  least  squares  end 
similar  methods  —  a  few  gross  errors,  or  even  a 
single  one,  can  lead  to  very  bed  solutions.  In 
practice,  RANSAC  can  be  used  as  a  method  for 
■electing  and  verifying  a  set  of  points  that  can  be 
confidently  fit  to  a  model  with  a  conventional 
method. 

(4)  Disparity  analysis  of  Images,  [7] 

The  Image  Correspondence  Problem,  [8] 

Points  are  matched  In  two  Images  that  differ 
because  of  normal  stereo,  camera  motion,  or  object 
motion.  The  Moravec  operator  la  used  to  select 
point  features  In  both  Images,  An  Initial 
collection  of  possible  matches  la  established  by 
linking  each  point  In  the  first  Image  with  possible 
matching  points  In  the  ascend  luge,  (A  point  In 
the  second  image  Is  considered  a  possible  match  If 
It  la  in  a  square  area  centered  on  the  position  of 
the  point  in  the  first  image,)  Each  point  from  the 
first  image  Is  considered  an  object  that  la  to  be 
classified  according  to  Its  disparity,  and  each  of 
its  possible  matches  establishes  s  label  denoting 
one  of  several  possible  classifications.  Each 
object  also  has  a  special  label  denoting  "no- 
match."  An  Initial  confidence  for  each  disparity 
label  Is  determined  based  on  the  mean-square- 
dlffereace  of  small  regions  surrounding  the 
possible  matching  points.  The  estimates  are 
iteratively  t proved  with  a  relaxation-labeling 
algorithm  that  uses  the  continuity  constraint. 
Support  for  each  label  of  a  particular  object  la 
calculated  from  the  neighboring  objects.  If 
relatively  many  nearby  objects  have  similar  labels 
with  high  confidence,  the  label  is  strongly 
supported  and  its  confidence  Increases,  If  no 
labels  are  atrongly  supported,  the  confidence  of 
the  "no-match"  label  Increases,  After  a  few 
iterations  (about  8)  the  confidence  estimates 
converge  to  unique  disparity  classifications  for 
each  point,  (Convergence  Is  not  guaranteed 
theoretically,  but  is  observed  experimentally). 
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Abstract 


A  simple  analytical  procedure  is  introduced  tor  utilising  a 
ubiquitous  engineering  a rd  architura!  structural  subelcment  to 
facilitate  automatically  cuing,  moncscopically  inferring  surface 
structure  and  orientation,  and  resolving  stereo  correspondences: 
orthogonal  trihedral  vertices,  orOTVs.  OTVs  occur  in  profusion 
indoors  and  out.  They  are  identifiable,  and  are  a  rich  source  of 
information  regarding  relative  surface  conformation  and  orien¬ 
tation  Practical  considerations  often  constrain  OTVs  to  be  ver¬ 
tically  aligned.  General  ohligue  perspective  properties  of  OTVs 
are  examined.  The  especially  important  case  of  nadir- viewing 
aerial  stereophotogran  metry  is  developed  in  detail.  An  object- 
space  vertex  labeling  convention  incorporates  vertex  type  and 
orientation.  A  set  of  image  space  junction  signature  rules  based 
upon  the  object  space  invariance  of  O TV  edge  vanishing  points 
enables  unambiguous  vertex  label  assignment  for  interior  and  ex¬ 
terior  OTVs.  An  independent  application  of  the  labeling  scheme 
to  both  members  ol  a  stereo  pair,  taken  at  arbitrarily  wide  con¬ 
vergence  angle,  identically  labels  corresponding  juntions.  An  il¬ 
lustrative  example  <s  presented.  Algorithmic  implementation  has 
not  yet  been  undertaken. 


1.  Introduction 


A  basic  goal  of  automated  image  interpretation  is  achieve¬ 
ment  of  a  symbolic  description  of  the  imaged  scene.  The  rela¬ 
tively  limited  potential  for  extracting  quantitative  geometric 
descriptions  from  single  images  is  vasdy  expanded  for  stereo 
images.  Whereas  the  present  work  was  motivated  by  problems 
in  stereo  processing,  it  will  be  seen  to  have  appicabilily  to  single 
image  interpretation  as  well. 

It  has  been  the  traditional  goal  of  automated  stereo  process¬ 
ing  systems  to  attain  an  accurate  and  detailed  depth  map.  The 
central  stereo  problem  is  that  of  establishing  correspondences 
between  points  in  stereo  image  pairs  associated  with  common 
object  space  points  in  the  original  scene.  Area  cross-correlation 
lechiqucs  |l  lannah  74,  Kelly  77,  Panlon  78|  have  worked  satisfac¬ 
torily  for  smoothly  undulating  relief,  but  are  unsuitable  where 
surface  slope  or  range  arc  discontinuous,  in  relatively  texture- 
less  regions  and  where  intorimage  surface  brightness  disparities 
are  extreme.  An  edge-based  approach  (Arnold  80,  Baker  80, 
Crimson  80,  Henderson  79,  Marr  79]  complementary  to  that  of 
area  cross-correlation,  offers  advantages  for  complex  structural 
configurations. 


Whereas  the  goal  of  symbolic  description  is  generally  explicit 
in  single  image  processing,  it  has  often  not  been  so  in  stereo 
image  processing.  It  is  to  be  expected  that  a  general  purpose 
truly  powerful  stereo  image  processing  system  should  be  capable 
of  delivering  not  only  an  accurate  and  complete  depth  map,  but 
also  a  quantitative  symbolic  description  of  the  imaged  scene.  The 
processes  of  a)  resolving  stereo  correspondences,  b)  arriving  at 
geometric  descriptions  of  ranged  surfaces,  c)  generating  volume 
descriptions  of  imaged  objects,  d)  inferring  the  geometry  of  por¬ 
tions  of  structures  visible  in  only  a  single  member  of  the  stereo 
pair,  e)  cuing  on  special  features,  and  .*')  generally  describing  the 
content  of  the  imaged  scene  all  can  be  facilitated  by  utilising 
world  model  information  relevant  to  the  environment  in  question. 
The  development  of  a  robust  and  accurate  stereo  ranging  system 
and  a  powerful  general  purpose  quantitative  symbolic  description 
capability  might  be  well  go  hand-in-hand. 

Inference  plays  an  important  part  in  human  visual  percep¬ 
tion.  Knowledge  of  the  nature  of  our  surroundings  influences 
perceptual  processes.  Consider  a  recent  experience  or  the  writer. 
While  driving  out  of  the  university  campus  in  the  dusk  nnc.  in 
driszle  nc  evening,  a  figure  suddenly  looming,  forward  and  to 
t-:e  lcft>  was  perceived  as  a  bicyclist  approaching  on  collision 
course  from  a  range  or  about  50  feet.  The  “bicyclist"  turned 
out  be  a  smudge  on  the  windshield,  abruptly  illuminated  by  the 
swinging  head  lights  of  a  car  100  yards  ahead.  The  apparition 
was  processed  rellexively,  and  conservatively,  for  a  fraction  of  a 
second  as  a  bicyclist  because,  presumably,  a)  it  was  superposed 
on  a  small  campus  side  street  Lorn  which  bicyclists  frequently 
emerge,  b)  it  subtended  roughly  the  proper  angular  size,  c)  being 
stationary  in  the  field  of  view  it  was  "on  collision  course",  and 
d)  it  was  a  likely  time  for  a  bicyclist  to  be  on  that  road.  This  is 
a  manifestation  of  the  limiter's  Principal,  namely,  to  “shoot  at 
anything  that  moves”  [Binford  81).  The  conclusion  happens  to 
have  been  erroneous,  but  it  was  a  conservative  and  proper  one 
under  the  circumstances.  Though  the  above  incident-  was  clearly 
not  processed  in  stereo,  it  is  an  example  of  the  human  system 
invoking  an  elaborate  inferential  mechanism,  doing  the  best  it 
could  with  the  context,  constraints,  world  model  information  and 
the  limited  quality  input  available  to  it. 

It  seems  reasonable  to  expect  that  a  powerful  automated 
system  f>r  generating  symbolic  descriptions  of  stereoscopical ly 
ns  well  as  monoscopically  imaged  scenes  should  incorporate 
capabilities  analogous  to  those  listed  above.  The  system  should 
he  facile  at  recognising  or  inferring  the  prosonse  of  common  or 
expected  features.  We  develop  a  case  in  point  below. 


i  ■  1 
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2.  Othogonal  Trihedral  Vertices  in  General  Perspective 


We  wish  to  consider  an  application  in  the  practical  context 
of  the  world  of  engineering  and  architectural  structures.  We 
are  interested  in  contributing  to  a  versatile  wide-angle  stereo 
processing  system  that  can  accomplish  the  processes  listed  ear¬ 
lier,  name'y,  resolve  stereo  correspondences,  infer  surfaces  and 
volumes,  infer  structure  visible  in  only  one  pe,  and  have 
the  rudiments  of  a  capability  to  report  a  symn^.  c  description. 
Two  kinds  of  three-dimensional  structural  feature  elements  that 
abound  in  such  a  world  are  the  "orners  of  right  parallelepipeds 
(RPl’s)  and  portions  of  fight  circular  cylinders.  We  wish  here  to 
concentrate  on  the  characteristics  f  the  ubiquitous  interior  and 
exterior  corner  elements  of  RPPs,  which  we  refer  to  as  orthogonal 
trihedral  vertices,  or  OTVs.  Though  these  particular  featvc  ele¬ 
ments  are  highly  specialized,  their  projected  images  exhibit  rich 
o»»  tits  rive  t  uctural  and  oiitotf'v.o.ia'  information.  We  adopt 
the  terminology  oi  ('Waltz  72j  in  referring  to  object  space 

corners  as  it  vertices  and  their  tv<o- dimensional  projections  as 
junctions. 


(a)  lb) 

Figure  1.  Perspective  camera  imaging  inclined 
plane,  (a)  Image  on  pseudolilm  plane,  (b)  View 
perpendicular  to  carr.  tra  axis  and  parallel  to  object 
plane." 


The  images  that  we  shall  be  dealing  with  are  assumed  to  be 
produced  by  central  proje.ction  iti  a  planar  perspective  camca. 
An  examination  of  the  properties  of  a  perspective  image  of  an 
inclined  plane  will  lead  us  to  useful  observations  regarding  the 
projective  properties  of  OTVs,  formed  by  the  intersections  of 
orthogonal  triples  of  planes.  Figure  1  suiematically  illustrates  a 
perspective  camera  photographing  an  inclined  plane.  The  view 
in  I  igure  lb  is  orthogonal  to  the  axis  of  the  camera,  and  parallel 
to  the  object  ’dane.  The  camera  is  shown  recording  the  image  on 
a  lorward,  or  pseudofilm  plane  PFP,  oriented  parallel  to  the  true 
him  plane,  and  located  the  same  distance  in  front  of  the  perspec¬ 
tive  centf  PC  as  the  true  fdm  plane  is  behind  it.  The  image 
recorded  on  the  pscudofilrn  plane  is  identical  to  that  formed  on 
the  true  film  plane,  hut  for  a  rotation  of  180  degrees  about  the 
camera  axir,  that  passes  througli  the  perspective  center  normal 
to  the  film  plane.  The  pseudofilni  plane  offers  the  perceptual 
advantage  of  exhibiting  its  image  “right  side  up”.  Henceforth, 


we  shall  use  film  plane  to  refer  to  cither  the  pseudolilm  plane  of 
the  true  film  plane.  The  point  of  intersection  of  the  camera  r.  •  is 
with  the  film  plane  is  referred  to  as  the  principal  jioint  PP.  The 
distance  between  the  perspective  center  and  the  principal  point 
is  called  the  camera  constant  r.  The  point  of  intersection  will) 
the  film  plane  of  a  ray  extended  from  the  perspective  center  in 
a  direction  normal  to  the  object  plane  is  designated  the  normal 
point  NI\  We  construct  a  horizon  plane  through  the  principal 
point  parallel  to  the  object  plane.  Its  intersection  with  the  film 
plane  defines  the  horizon  tine  III,.  Note  that  the  product  of  the 
distance  n  from  tne  principal  point  to  the  horizon  line  and  the 
distance  b  f-jm  the  principal  point  to  the  normal  pom,  is  equal  to 
c*.  It  will  additionally  he  seen  from  Figure  lb  that  the  distance 
r  :b  equal  to  the  product  Cc,  where  G  is  the  magnitude  of  the 
gradient  of  tho  height  of  the  object  plane  expressed  in  camera 
coordinates.  Figure  la  illustrates  the  perspective  image,  recorded 
on  the  film  plane.  The  normal  point  lies  on  an  extension  oi  a 
perpendicular  from  the  horizon  line  through  the  principal  point. 

The  reader  is  cautioned  that  within  this  paper  wc  do  net  al¬ 
ways  take  the  care  that  is  merited  to  distinguish  in  our  terminol¬ 
ogy  between  object  space  features  and  their  Images.  This  is  not 
entirely  a  tiivial  matter.  The  writer  has  more  than  once  been 
victim  of  the  sloppy  thought  that  can  result  from  a  confusion  of 
object  space  and  image  space  constructs.  It  would  be  justified, 
in  a.  more  careful  writing,  to  lake  care  l.o  make  the  distinction, 
unless  it  has  been  determined  that  there  is  no  potential  for  cither 


Figure  2.  Perspective  view  of  seven  RPPs.  The 
RPPs  are  not  identical,  but  each  is  aligned  with  its 
faces  parallel  to  the  corresponding  faces  of  the  other 
RPPs.  The  three  straight  lines  that  intersect  to  form 
the  central  triangle  arc  horizon  lines  for  the  faces  of 
the  RPP 
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3.  Detection  of  OTVs  in  the  General  Cate 


Figure  2  illustrates  a  perspective  view  of  seven  RPPs,  The 
RPPs  are  not  identical,  but  each  is  aligned  with  its  faces  parallel 
to  corresponding  faces  of  the  each  of  the  other  RPPs.  Corners 
corresponding  to  the  central  coiner  of  the  central  RPP  are 
circled.  The  point  PP  is  the  principal  point  of  the  film  plane. 
Consider  for  the.  moment  trie  single  RPP  located  inside  the 
central  t.  ■angular  region.  Us  faces  are  laliled  U,  W,  and  S,  for 
“up”,  “west",  and  “south",  respectively.  No  particular  absolute 
orientation  is  implied.  The  familiar  directional  associations  are 
to  facilitate  conceptualisation.  The  horizon  line  for  the  face 
labeled  U  is  designated  HLui>,  to  indicate  that  it  is  the  horizon 
line  for  both  the  U  face,  and  its  parallel  opposing  face  D,  for 
“down’’,  as  well.  Indeed  l il , u is  the  horizon  line  for  the  entire 
family  of  planes  parallel  to  these  two  faces  of  the  RPP.  The 
horizon  lines  for  the  other  feces  constitute  the  remaining  sides  of 
the  central  triangle.  The  intersections  of  the  three  horizon  lines 
correspond  to  the  vanishing  points  for  the  edges  bonneting  the 
faces  of  the  RPP.  Now,  il  is  a  feature  of  an  OTV,  as  opposed 
to  the  case  of  an  arbitrary  object  space  trihedral  vertex,  that 
the  normal  to  any  f.-..  e  to 'nciated  with  'he  vertex  is  parallel  to 
the  edge  departing  i!  c  erlex.  T'  us  there  is  a  coincidence  of 
normal  points  and  eogc  vanishing  points  in  the  perpective  plane. 
It  will  th'  ielore  be  appropriate  for  the  case  of  RPPs  to  label  the 
vanishing  j-jims  of  the  edges  with  the  names  of  the  RPP  facc3 
whose  outward  normals  are  dire’-'e  '  toward  them.  Thus,  the 
vanishing  points  for  the  normals  to  the  RPP  faces  U,  W  and  S 
are  lablcd  wit ti  the  names  of  iluu  apposing  faces,  namely,  1),  E 
and  N,  respectively.  The  three  faces  lJ.  W  and  S  are  equivalent 
in  the  projection  of  'figure  ■  Therefore,  the  interrelationships 
■•'•'ving  principal  point,  horizon  lines  ami  vanishing  points  de- 
I'  igure  1  appiy  ei|.ial)y  to  each  of  these  faces. 


The  three  horizon  planes  associated  with  the  faces  of  the 
RPP  divide  the  forward  object  space  into  seven  zones.  The  film 
plane  is  correspondingly  divided  into  seven  sector)  by  the  three 
intersecting  horizon  lines.  An  RPP  has  been  placed  in  each  of  the 
zones.  It  will  he  noted  that  when  an  RPP  face  is  translated  across 
its  respective  horizon,  the  reverse  side  of  the  face  (geometrically, 
if  not  physically)  becomes  potentially  visible.  If,  for  example, 
an  RPP  is  translated  from  beneath  the  IILuij  horizon  plane  to 
entirely  above  it,  then  the  D  face  becomes  visible.  An  inspection 
of  Figure  2  reveals  that  all  faces  of  a  generally  oriented  RPP  may 
be  made  visible  by  translational  displacement,  without  rotation. 
It  should  be  noted  in  passing,  however,  that  for  the  case  of 
symmetrica!  angular  alignment,  where  each  of  the  OTV  edges 
is  equally  inclined  relative  to  the  camera  axis,  these  edges  will 
be  inclined  at  an  angle  of  approximately  54.7  degrees  to  the 
forward  projection  of  the  axis.  Any  departure  from  symmetrical 
alignment  will  force  at  least  one  of  the  vanishing  points  to  move 
to  an  even  greater  inclination  to  the  axis.  Thus,  a  camera  with 
a  110  degree  field  or  view  would  be  required  in  order  to  imago 
even  the  tightest  configuration  of  all  three  vanishing  points.  A 
00  degree  field  of  view  would  be  required  to  capture  even  two 
of  the  vanishing  points  under  the  most  favorable  conditions  of 
alignment,  corresponding  to  the  third  vanishing  pom*,  being  at 
infinity.  A  conventional  camera  can  not  tb  jreforc  be  expected  to 
capture  more  than  a  single  vanishing  po  r  in  any  given  image. 


We  have  already  suggested  that  an  OTV  is  an  example  of 
an  object  that  generally  presents  a  challenging  wide-angle  stereo 
correspondence  problem  since  the  projection  into  the  two  images 
can  be  quite  disparate.  The  detection  of  OTVs  in  a  single  image 
is  not  a  totally  trivial  matter  either  [RobertB  63],  even  as  a 
geometrical  problem,  disregarding  the  realities  or  dealing  with 
digitized  images  of  real  scenes,  wherein  one  must  contend  with 
issues  of  resolution,  noise,  shadows,  and  the  like.  The  problem  of 
OTV  detection  for  the  case  of  general  oblique  perspective  projec¬ 
tion  has  not  yet  been  pursued  in  detail  by  the  writer.  We  shall 
shortly  be  imposing  practical  constraints  that  will  simplify  the 
problem  for  a  large  and  important  class  of  cases.  Nevertheless, 
wc  relied  briefly  on  the  general  case.  A  necessary  condition  for  a 
three-legged  image  junction  to  correspond  to  an  OTV  is  that  the 
projections  of  the  legs  pass  though  an  acceptable  triple  of  vanish¬ 
ing  points,  appropriately  arrayed  about  the  principal  point,  per 
the  conditions  discussed  in  conjunction  with  Figures  1  and  2. 
Though  the  condition  is  not  sufficient,  it  is  expected  to  be  highly 
indicative.  There  are  constraint  conditions  on  the  values  that  can 
be  assumed  by  the  projected  internal  angles  at  triple  junctions 
corresponding  to  OTVs.  Though  the  writer  has  not  yet  pursued 
the  point,  it  iB  speculated  that  these  are  limiting  conditions  cor¬ 
responding  to  those  that  must  be  satisfied  by  the  vanishing  point 
relations.  When  there  is  coherance  of  alignment  of  object  space 
features  over  a  relatively  wide  camera  window,  then  vanishing 
point  locations  can  be  inferred  from  single  or  concurrent  inter¬ 
sections  of  leg  extensions,  and  the  vanishing  point  configuration 
checked.  If  it  is  known  in  advance  or  presumed  tnat  one  iB  deal¬ 
ing  with  RPPs,  then  the  junction  location  and  the  directions  of 
the  film  plane  projections  of  the  three  legs  is  sufficient  to  yield 
the  orientation  of  the  OTV  relative  to  the  camera,  though  of 
course  not  the  range.  Stereo  correspondence  and  triangulation 
could  establish  the  latter. 


Most  of  the  OTVs  in  a  given  structure  will  be  aligned  with 
their  edges  mutually  parallel.  In  fact  it  will  not  bo  uncommon 
for  assemblies  and  collections  of  stuctures  to  be  similarly  aligned, 
such  as,  for  example,  in  the  case  of  buildings  in  a  city  block. 
External  factors  can  mediate  absolute  alignment  of  OTVs  at 
well.  Of  these,  the  direction  of  the  gravity  vector  is  perhaps  the 
single  most  import.-  determinant  in  the  alignment  of  cultural 
objects  relative  to  the  Earth.  This  is  of  course  especially  the 
case  for  architectural  structures.  Tnrou'ghout  such  structures 
right  angles  and  OTVs  abound.  Exterior  OTVs  arc  seen,  for 
example,  at  intersections  of  walls  and  roof;  of  concrete  buildings, 
at  doorways,  windows,  etc.  They  are  found  in  structui  1  interiors 
at  the  corners  of  rooms,  doors,  windows  and  the  like,  as  well  as 
on  their  contents,  such  as  innumerable  small  items,  machinery, 
cabinets,  shelves  and  desks,  Most  of  these  OTVs  will  contain 
vertical  edges,  and  often  they  are  mutually  azimuthaily  aligned 
as  well.  Walls  ate  generally  vertical,  and  thus  too  the  edges 
where  they  intersect.  Roofs  are  most  generally  composed  of 
p-  .nar  sections,  the  surfaces  largely  being  cither  horizontal  or 
symmetrically  configured  shedding  slopes  the  perimeters  of  which 
are  horizontal. 
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4.  Nadir- Viewing  Aerial  Stereophotogrammetry 

The  most  common  application  of  stereo  processing  is  in 
aerial  gtcrcophotograinmetry.  The  nominal  situation  involves 
acquisition  of  a  pair  of  photographs  taken  by  a  nadir  (vertically 
downward)  directed  camera  at  stations  horizontally  displaced 
from  one  another  by  a  substantial  fraction  of  the  height  of  the 
camera  above  the  terrain.  Such  wide-angle  stereo  favors  ac¬ 
curacy,  while  at  the  same  time  complicating  the  determination 
of  stereo  correspondences  within  highly  convoluted  or  complex 
structures.  Because  of  the  importance  of  nadir  viewing  aerial 
gtorcophotograminctry,  we  now  direct  our  attention  to  this  spe¬ 
cial  case.  We  will  concern  ourselves  with  structures  containing 
vertical  and  horizontal  edges,  many  of  which  meet  to  form  ver¬ 
tically  aligned  OTVs.  We  shall  assume  for  the  purpose  of  the 
presentation  that  the  camera  is  pointed  directly  downward.  In 
practise,  the  departure  from  perfect  nadir  alignment  will  require 
a  slight  generalization  of  the  implementation.  Thus,  the  camera 
axis  will  be  parallel  ‘o  the  gravity  vector  and  to  the  vertical 
edges  of  structures.  This  is  a  degenerate  case  of  the  general 
oblique  configuration  illustrated  in  Figure  2,  in  that  a  vanishing 
point  .coincides  with  the  principal  point.  Horizontal  surfaces  will 
be  parallel  to  the  film  plane.  This  makes  the  configuration  an 
especially  simple  one  to  deal  with.  The  coincidence  of  the  nadir 
vanishing  point  with  the  principal  point  drives  the  remaining  pair 
of  OTV  edge  vanishing  points  to  infinite  distance  in  mutually 
orthogonal  directions  relative  to  the  principal  point  on  the  film 
plane. 

it  will  help  in  the  following  discussion  for  the  unfamiliar 
reader  to  be  alerted  the  utility  of  epipolar  lines.  Consider  a 
pair  of  stereo  cameras  located  in  known  relative  positions  and 
orientations.  Any  plane  containing  the  perspective  centers  of 
both  cameras  is  called  an  epipolar  plane.  The  intersections  of 
an  epipolar  plane  with  the  two  film  planes  are  called  epipolar 
lines.  Each  object  space  point  has  associated  with  it  then  an 
epipolar  plane  and,  therefore,  a  pair  of  epipolar  lines.  It  follows 
that  corresponding  points  in  a  pair  of  stereo  images  must  lie  upon 
corresponding  epipolar  lines.  This  is  clearly  a  powerful  constraint 
in  searching  for  corresponding  stereo  image  points. 

We  wish  now  to  consider  the  projective  characteristics  of 
nadir-viewed  vertically  aligned  exterior  and  interior  OTVs.  We 
will  develop  a  set  of  visibility  ruicB,  or  junction  signatures,  that 
will  characterize  the  appearance  of  tiie  individ  lai  OTVs.  Given 
a  junction  configuration  corresponding  to  an  arbitrarily  oriented 
OTV  in  one  image,  the  rules  will  enable  the  quantitative  deter¬ 
mination  of  the  appearance  of  the  corresponding  junction  in 
the  other  image,  as  a  function  of  position  along  the  associated 
epipolar  line.  The  development  of  the  rules  will  be  facilitated  by 
considering  the  array  of  RPP  wire  models  illustrated  in  Figure  3. 
The  perspective  in  this  figure  is  vastly  exaggerated  relative  to  the 
typical  high  altitude  situation.  This  is  evident  from  the  fact  that 
the  ratio  of  object  space  height  or  a  vertical  edge,  such  as  that 
corresponding  to  (C,C),  to  Inc  height  of  the  camera  above  its 
base  C  is  equal  to  the  image  space  ratio  (C,C')/(C,NP).  Thus, 
the  camera  is  in  this  example  not  much  more  than  twice  tl.e 
height  of  the  structure.  T  lie  wire  figures  will  serve  as  gererators 
for  tiie  junction  signatures  of  solid  OTVs.  The  wire  models  may 
be  considered  to  rest  upon  the  nominal  ground  plane,  in  a  i.<m- 
mon  state  of  rotational  alignment  about  the  vertical.  The  nadir 
point,  horizontal  surface  normal  point  and  principal  point  are 
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Figure  3.  Nadir  view  of  four  gravity-aligned  wire 
model  RPPs.  The  horizon  lines  have  been  labeled 
with  relative  compass  directions.  The  epipolar  line 
through  the  nadir  point  corresponds  to  a  projection  of 
the  (light  path.  The  wire  models  are  used  to  generate 
image  plane  junction  signatures  for  tiie  OTVs.. 


mutually  coincident  at  NP.  The  lines  shown  intersecting  at  the 
nadir  point,  are  the  horizon  lines  for  the  vertical  faces  of  the 
model.  They  are  at  right  angles  to  one  another  for  this  special 
case.  Each  of  the  four  wire  models  lias  been  placed  in  a  separate 
one  of  the  zones  created  by  the  horizon  planes,  after  the  fashion 
of  the  general  oblique  case  depicted  in  Figure  2.  As  per  the  dis¬ 
cussion  or  the  general  case,  the  potential  visibility  of  the  sides, 
vertices  and  edges  of  a  solid  RPP  figure  filling  any  one  of  the 
wire  models  is  invariant  under  translation  within  any  given  zone. 
The  flight  line,  which  generates  the  intercamera  baseline,  extends 
from  left  to  right. 

Tt  will  be  recalled  that  the  sector  demarcation  lines  in  Figure 
2  were  horizon  lines.  Horizon  lines  on  the  film  plane  correspond 
to  object- space  horizon  planes  and,  therefore,  to  no  particular 
direction  at  all  in  object  space.  At  the  risk  of  some  confusion 
(regarding  implications  for  the  general  oblique  case),  it  v.'ili  be  a 
convenience  in  the  nadir  case  to  consider  a  dual  interpretation  for 
the  horizon  lines  on  the  film  plane.  We  will  consider  them  also 
to  represent  projections  of  object-space  lines  that  pass  though  a 
projection  of  the  nadir  point,  perpendicular  to  the  nadir  direc¬ 
tion  toward  the  horizontal  vanishing  pointt.  The  rays  formed 
by  breaking  the  lines  at  the  nadir  point  correspond  to  the  ouG 
ward  surface  normals  for  the  vertical  faces  of  the  corresponding 
solid.  We  label  the  rays  with  relative  (not  necessarily  geographic) 
compass  directions  N,  W,  S  and  E.  The  E  direction  is  defined 
to  be  that  of  the  flrBt  log  encountered  rotating  counterclock¬ 
wise,  looking  downward,  from  the  rightward  direction  along  an 
the  epipolar  line,  labeled  in  this  illustration  as  the  flight  path 
projected  through  the  nadir  point.  It  iB  tempting  to  call  the 
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sectors  into  which  the  horizon  lines  partition  the  film  plane  in 
the  nadir  case  “quadrants”.  However,  when  the  camera  is  not 
directed  precisely  downward,  the  horizon  lines  will  not  intersect 
at  right  angles  on  (he  film  plane,  as  we  have  observed  in  Figure  2. 
Thus,  we  retain  the  sector  terminology.  We  number  the  sectors 
counterclockwise  in  the  order  I,  2,  3  and  4,  with  sector  1  cor¬ 
responding  to  the  region  between  the  K  and  N  legs  of  the  nc; 
rose.  Figure  3  indicates  a  labeling  for  the  wire  model  vertices. 
The  four  upper  vertices  are  labeled  A,  D,  C  and  D;  the  four  lower 
ones  A’,  IT,  C  and  D'.  Vertex  A’  is  directly  beneath  A,  and  so 
forth.  The  labeling  assignments  must  bn  made  in  the  relative 
positions  indicated  with  respect  to  the  vanishing  point  compass 
rose. 
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Figure  4.  The  set  of  sector  1  junction  signatures 
for  i  ire  model,  and  for  interior  and  exterior  OTVs. 


We  now  introduce  a  junction-signature  assignment  for  the 
OTV  projections  on  the  film  plane.  The  scheme  we  employ 
assigns  to  each  junction  the  directions  of  the  vanishing  points 
associated  with  each  of  the  visible  outward-bound  edges  of  the 
object  space  OTV,  For  the  case  of  the  wire,  model,  the  assignment 
is  invariant  under  translation  of  model  relative  to  camera,  or 
vice  versa,  throughout  the  entire  region  of  object  space  in  front 
of  the.  camera.  Consider  for  example  wire  model  vertex  A.  It 
has  associated  with  it  edges  directed  W,  S  and  D.  Each  of  the 
edges  is  visible.  Thus,  we  say  that  junction  corresponding  to 
the  wire  model  vertex  A  has  signature  WSD.  In  similar  fashion 
we  assign  to  wire  model  vertex  D’,  for  example,  the  junction 
signature  NWU,  and  so  forth  for  the  remainder  of  the  j  motions. 
The  signatures  for  the  eight  wire  model  junctions  are  indicated 
in  the  upper  left  of  Figure  4.  The  eight  junction  signatures  are 
seen  to  be  unique. 


The  signatures  for  a  RPP  solid  may  be  generated  from  the 
wire  models  of  Figure  3  by  imagining  the  wire  model  edges  to 
correspond  to  the  edges  of  a  RPP  solid.  The  signatures  for  the 
junctions  associated  with  each  of  the  solid  vertices  is  a  obtained 
by  masking  the  signatures  for  the  wire  model  vertices  accord¬ 
ing  to  the  visibility  of  the  associated  edges,  that  is  according 
to  whether  or  not  they  are  self  obscured.  A  solid  junction  sig¬ 
nature  will  consist  of  the  list  of  the  names  of  the  object  space 
directions  associated  with  the  outward  pointing  visible  legs  in 
the  projection.  The  signature  associated  with  a  solid  vertex  will 
be  invariant  under  vertex  translation  throughout  any  given  one 
of  the  four  object  space  zones  into  which  space  is  partitioned 
by  the  two  vertical  horizon  planes.  Abrupt  changes  of  signa¬ 
ture  can,  however,  occur  upon  translation  of  a  vertex  across  a 
horizon  plane.  Thus,  it  is  necessary  to  indicate  a  separate  set 
of  signatures  for  each  of  the  four  zones  within  which  the  OTV 
can  reside,  or  correspondingly  for  each  of  the  four  sectors  into 
which  its  projected  junction  can  fall.  Consider,  for  example,  the 
wire  model  image  in  sector  1  of  Figure  3  to  correspond  to  a  solid. 
Solid  vertex  A  is  assigned  junction  signature  WS,  compared  to 
the  corresponding  wire  model  junction  signature  WSD,  since  for 
the  solid  the  edge  directed  toward  the  nadir  point  is  invisible. 
Vertex  11  is  assigned  junction  signature  SED,  the  same  as  that 
for  the  wire  model,  since  all  edges  of  this  vertex  are  visible.  The 
junction  signatures  associated  with  RPPs  situated  in  other  zones 
ar  >  developed  in  like  fashion  to  that  indicated  here  for  sector 
1.  The  set  of  sector-1  junction  signatures  associated  with  the 
exterior  OTVs  of  a  zone-1  solid  is  indicated  in  the  upper  right 
of  Figure  4.  Each  signature  is  seen  to  be  unique.  Appendix 
A  lists,  in  condensed  form,  the  complete  set  of  junction  signa¬ 
tures  associated  with  solid  exterior  OTVs  occurring  in  each  of 
the  four  possible  zones.  The  junction  assignments  arc  organized 
in  accordance  with  the  sectors  within  which  they  fall.  Since  the 
signature  of  a  solid  exterior  vertex  of  particular  orientation  can 
only  change  when  the  vertex  is  is  translated  across  a  horizon 
plane,  the  vertex  label  is  completely  and  uniquely  characterized 
by  a)  the  sector  within  which  it 's  appears,  and  b)  its  junction 
signature. 

Finally,  we  wish  to  consider  reentrant,  or  interior  OTVs, 
such  as  would  be  encountered,  for  example,  in  images  of  roof 
depressions,  windows,  doors  and  the  like.  We  will  refer  to  these 
as  interior  OTVs.  Their  signatures  can  be  generated  by  a  con¬ 
sidering  the  images  of  rectangular  holes  in  planar  solid  surfaces. 
It  is  again  convenient  to  refer  to  the  wire  models  of  Figure 
3.  Consider  now  the  wire  mode!  in  zone  1  to  be  the  junction- 
signature  generator  for  the  vertices  associated  with  a  rectangular 
hole  in  the  U,  or  top,  side  of  a  horizontal  surface  of  a  solid. 
Reference  to  Figure  3  indicates  that  an  interior  OTV  at  vertex 
location  A  will  have  junction  signature  WSD,  identical  to  that 
for  the  wire  model.  We  assign  this  vertex  the  label  UA.  The  ver¬ 
tex  at  B,  labeled  UB,  will  have  junction  signature  SE,  compared 
to  SED  for  the  wire  model.  Vertex  A’  will  have  signature  WSU, 
if  the  hole  <s  shallow  enough  for  the  vertex  to  be  visible.  A  WSU 
junction  signature  can  be  generated  by  holes  in  any  of  the  three 
sides.  Furthermore,  it  is  impossible  from  a  consideration  of  the 
directions  of  the  legs  alone  to  determine  whether  the  hole  is  blind 
or  is  clear  through  to  the  opposing  side.  This  vertex  type  cannot 
be  uniquely  associated  with  a  hole  on  any  particular  face.  Wc 
will  label  it  X’  when  it  is  later  recogniziod  in  an  image.  The  fact 
that  its  visibility  is  indeterminate,  the  signature  will  be  enclosed 
in  parentheses.  This  is  the  only  solid  vertex  label,  among  both 
interior  and  exterior  OTVs,  that  has  multiple  junction  entries 
under  the  present  scheme.  It  speaks  more  to  the  nomenclature, 
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however,  than  to  the  geometry,  since  all  of  these  indistinguish¬ 
able  vertices  have  identically  directed  edges  in  object  space.  The 
signatures  Tor  the  junctions  situated  in  other  sectors  are  devel¬ 
oped  in  like  fashion  to  that  indicated  here  for  sector  1.  The  set 
of  junction  signatures  for  sector  1  interior  OTVs  is  presented  in 
the  lower  portion  of  Figure  4.  The  complete  list  of  interior  OTV 
junction  signatures  i'or  all  sectors  is  given,  in  condensed  form,  in 
Appendix  A. 

Though  both  the  sector  and  visiblc-cdgo  signature  assign¬ 
ments  may  differ  in  stereo  images  junction  pairs  corresponding 
to  the  same  object  space  OTV,  the  labeling  assignment,  which 
is  determined  by  tbe  joint  values  of  sector  number  and  signa¬ 
ture  assignment,  will  be  identical.  This  will  be  true  for  the 
general  oblique  case  as  well  as  for  the  nadir  case  that  we  have 
concentrated  on.  This  is  the  basis  for  the  uLility  of  the  scheme 
for  facilitating  resolution  of  stereo  corrspondonces.  Application 
of  the  labeling  scheme  to  corresponding  stereo  junctions  yields 
identical  label  assignments  regardless  of  the  sice  of  the  conver¬ 
gence  angle.  Additionally,  the  absolute  orientation  of  the  OTV 
is  established  by  the  inference  of  the  directions  of  the  vanishing 
points.  Though  the  labeling  assignment  will  be  identical  for  both 
members  of  any  given  stereo  pair,  regardless  of  relative  camera 
orientation  and  position,  this  does  not  imply  that  the  assignment 


iB  the  same  for  all  pairs  of  cameras  that  might  record  the  scene. 
The  assignment  is  dependent  upon  tne  object  soace  direction 
of  the  intercamera  baseline  which,  inturn,  defines  the  family  of 
epipolar  planes.  The  direction  in  sr-ace  of  the  baseline,  and  the 
stereo  sense  in  which  it  is  viewed,  'nfluences  assignment  of  tones 
and  sectors,  and  the  labels  of  vanishing  points  and  junctions  for 
the  members  of  a  stereo  pair  of 'mages.  The  labeling  assignment 
is  stereo-pair  specific.  The  inherent  orientation  information  is 
relative  to  the  stereo  camera  system. 

The  search  for  gravity  aligned  OTVs  in  nadir  imagery  is 
more  straightforward  than  that  described  for  the  the  general 
oblique  case.  It  is  a  feature  of  pianar  perspective  projection 
that  the  projection  of  a  planar  object  space  figure  aligned 
parallel  to  the  plane  of  projection  is  but  a  rescaled  version 
of  an  orthographic  projection  of  the  given  figure.  Thus,  in 
nadir  photography,  contour  lines  and  horizontal  surfaces  are  res- 
caiad  orthographic  projections.  Horizontal  surfaces  of  differing 
heights  will  experience  relative  displacement  on  the  image  plane. 
Horizontal  angles  are  invariant  under  the  projection.  Nadir 
aligned  OTVs  will  have  two  horizontal  and  one  vertical  edge. 
Either  two  or  three  of  the  legs  will  be  visible  on  projection.  If 
Three  legs  ais  visible,  then  the  horizontal  pair  will  project  as  a 
right  angle  and  the  vertical  leg  will  project  as  an  edge  aligned 


SECTOR  I 


npleft 

Figure  5.  Stereo  image  pair  of  “architectural” 
complex,  acquired  by  nadir  viewing  cameras  The 
left  image  i3  in  sector  i  and  the  right  image  is  in 
sector  2.  Tlie  OTV  labels  have  been  assigned  by  an 
application  of  the  junction  signature  tables  developed. 
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with  the  nadir  point.  If  only  two  edges  are  visible,  then  they 
will  project  either  as  a  right  angle  cr  as  one  edge  aligned  with 
the  nadir  point  and  a  second  edge  generally  in  some  other  direc¬ 
tion,  though  parallel  with  that  of  other  like-aligned  OTVs,  if 
equivalent  legs  of  the  latter  arc  visible.  The  three-legged  OTV 
junction  appears  to  be  relatively  the  most  unambiguous  indicator 
of  an  OTV.  Local  collections  of  image  lines  sasociated  with  a 
common  triple  of  appropriately  arrayed  vanishing  points  would 
appear  to  suggest  presense  of  OTVs.  Degeneracies  arise  along 
horijion  lines,  with  a  double  degeneracy  occurring  at  the  nadir 
point.  T-juntions,  oblique  and  orthogonal,  can  suggest  painted 
surfaces,  obscuration  of  features,  and  the  like,  as  well  as  a 
degenerate  OTV.  A  myriad  of  factors  can  complicate  the  detec¬ 
tion  of  OTVs  in  real  imagery,  such  as  signal  to  noise  ratio,  resolu¬ 
tion,  shadow,  painted  markings,  alignment  degeneracy,  and  the 
like. 

An  illustrative  application  of  the  OTV  typing  scheme  is 
depicted  in  Figure  5.  The  figure  schematically  depicts  a  pair  of 
nadir-acquired  aerial  stereo  images  of  a  gravity-aligned  “archi¬ 
tectural”  structural  complex.  Epipolar  lines  run  left-right.  The 
OTVs  are  all  commonly  aligned  in  azimuth  in  the  figure.  Our 
convention  lor  labeling  the  compass  directions  orients  E  in  the 
direction  to  the  upper  right  and  N  in  the  direction  to  the  upper 
lefL  in  the  figure.  The  scheme  places  the  all  the  OTVs  of  the 
left  image  into  sector  1,  and  all  those  in  the  right  image  into 
sector  2.  The  sector-specific  junction  signature  rules,  given  in 
the  Appendix  A  have  been  used  to  associate  object  space  labels 
with  the  vertices.  It  v/ill  be  noted  that  the  corresponding  vertices 
in  the  two  images  have  been  identically  labeled. 
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Appendix  A 

Condensed  Summary  of  Nadir- Viewed  Gravity- Aligned  OTVs 

The  following  is  a  condensed  summary  of  the  solid  interior 
and  exterior  signatures  for  all  sectors  in  the  case  of  nadir-viewed 
gravity  aligned  OTVb.  The  four-fold  rotational  symmetry  en¬ 
ables  the  condensed  form.  The  diagonal  numbers  are  sector 
numbers.  To  the  right  and  below  each  sector  number  are  the 
associated  vanishing  point  directions  and  vertex  labels,  respec¬ 
tively. 
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Abitract 

A  procedure  for  calibration  of  the  stereo  camera 
transform  is  described  which  uses  a  variable  projection 
minimization  algorithm,  applied  to  an  error  function 
whose  dependence  on  the  five  camera  model  parameters 
is  separable  into  linear  and  non-linear  components. 
The  result  is  a  non-linear  minimisation  over  three  vari¬ 
ables  rather  than  five.  The  procedure  has  been  imple¬ 
mented  in  MACLISP,  with  good  preliminary  results. 

Introduction 

This  paper  describes  a  partial  solution  to  the  fol¬ 
lowing  problem:  Images  from  two  cameras  are  given, 
with  matchings  between  points  in  each  image  which 
are  thought  to  correspond  to  the  same  point  in  3-space, 
with  no  furthor  information  such  as  the  location  of 
known  object  points.  The  problem  is  to  use  the  match¬ 
ings  to  find  the  location  and  orientation  of  one  camera 
with  respect  to  the  other.  The  solution  described  here 
uses  geometrical  constraints  from  the  matchings  to  find 
a  least  squares  error  function  of  the  five  parameters  in¬ 
volved.  This  error  function  is  in  effect  linear  in  the 
location  parameters,  so  that  best  least  squares  values 
for  those  parameters  can  be  obtained  for  given  orien¬ 
tation  parameters.  The  error  of  this  fit  is  then  mini¬ 
mized  over  the  orientation.  The  result  is  a  non-linear 
minimization  in  three  variables  rather  than  five,  an  im¬ 
provement  over  previous  methods,  (cf.  |5])  Note  that 
the  images  are  arbitrary,  in  the  sense  that  they  can 
be  from  two  cameras  at  one  time,  or  from  one  moving 
camera,  or  from  one  fixed  camera  on  a  moving  object. 

Derivation  of  the  Procedure 

More  precisely,  let  camera  1  be  at  the  origin,  look¬ 
ing  along  the  z  axis,  and  let  camera  2  be  at  point  c; 
looking  in  such  a  direction  that  a  point  y  it  seen  by 
camera  2  at  R{y  —  c),  where  R  is  a  rotation  (hence 
orthonormal)  matrix.  Now  we  know  that  the  following 
vectors  are  co-planar:  that  from  camera  1  to  y,  from 


camera  1  to  camera  2,  and  from  camera  2  to  y.  (See 
Figure  1)  These  are  just  y,  c,  and  y  —  c,  respectively. 
Therefore,  y  X  (y  —  c)  is  perpendicular  to  c.  Thus  if  we 
know  R,  and  have  vectors  a  parallel  to  y,  i  parallel  to 
R(y  —  c),  we  have  a  X  Rrb  perpendicular  to  c.  This  is 
true  for  all  object  points  y,  and  the  image  point  data 
(together  with  camera  focal  lengths)  give  a  set  of  vec¬ 
tors  o,-,  and  6,'  corresponding  to  some  y,-.  These  allow 
us  to  estimate  the  location  c  of  camera  2  as  the  vector 
minimizing 

f>.(«,XKr6,))a  (1) 

'ex  1 

for  m  the  number  of  image  point  matchings.  Any 
vector  parallel  to  c  will  do,  so  we  will  fix  the  third 
component  of  c  to  1.  As  a  result  we  have  a  linear  least 
squares  problem.  In  c,  given  the  rotation  matrix  R  of 
the  orientation  of  camera  2.  So  we  can  estimate  R  by 
finding  a  value  of  it  minimising  the  error  in  the  fit. 

A  well  known  algorithm  (among  numerical  analysts) 
for  solving  separable  ieart  squares  problems  of  this 
kind  is  the  variable  projection  method[l][2).  This 
method  uses  the  Levenberg-Marquardt  iteration  (which 
has  the  Gauss-Newton  iteration  as  a  special  case)  for 
the  solution  of  the  non-linear  part,  which  involves  the 
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linearisation  of  the  problem  at  each  iteration  and  th> 
least  squares  solution  of  the  linearisation.  Thus  each 
iteration  involves  two  least  squares  problems,  which 
can  both  be  solved  by  using  the  numerically  stable 
technique  of  Householder  transformations  to  find  the 
generalised  inverses  of  the  matrices  involved.  The  fol- 
lo'ving  development  of  the  method  as  applied  to  camera 
calibration  follows  the  discussion  in  [2]. 

Let  di  =  a<  X  R(f))Tbi,  where  a,-  and  are  unit 
vectors  pointed  from  camera  1  and  camera  2  to  y,  as 
easily  obtained  from  the  image  point  matchings.  The 
dependence  of  R  on  the  three  orientation  parameters 
0  —  {Pi<P2i  Pi)  i*  explicitly  indicated.  (The  nature  of 
this  dependence  is  discussed  below.)  Let  ♦(/?)  be  the 
m  x  2  matrix  whose  t’th  row  is  the  z  and  j-  components 
of  di,  and  let  y(0)  be  the  m-length  column  rector  whose 
t’th  component  is  the  t  component  of  d,.  Then  the 
problem  of  minimising  (1)  is  equivalent  to  minimising 

||y(/9)-4»(/J)x||2  (2) 

over  x  e  Ra,  for  given  0.  This  can  be  done  in  the  fol¬ 
lowing  way:  compute  the  "orthogonal  decomposition” 
of  <t>,  i.e.,  find  an  m  x  m  matrix  Q(0)  such  that 

Qrn(p)  =  0) 

where  U  is  a  right  triangular  2x2  matrix,  and  Q  it 
orthonormai.  Partition  Q  into  two  sub-matrices  Qx 
and  Q3,  with  the  first  2  rows  of  Q  forming  Qx  and 
the  remaining  m  —  2  rows  forming  Q3.  Since  Q  is 
orthonormal,  we  know  that  j|Qxjj  =  |jjr||  for  m- recto. 
z,  so 


||y  —  4x||3  =  ||Q(y  —  ♦*)||a 

-KSD-CDT 

=  IIOsV~C/*||aH-  ||QaVlla 

Since  ||Qay|ja  doesn’t  depend  on  x,  the  optimal  x  for 
fixed  0  is  x  =  U~i{0)Q\{P)v{0),  and  we  want  to 
minimise  ||/(d)||a,  where  /  is  the  —  vector  valued 
function  Q2(P)v{0)- 

We  can  minimise  the  norm  of  /  using  the  Levenberg- 
Marquardt  algorithm,  which  iteratively  refines  0  as  fol¬ 
lows:  for  the  current  value  of  0,  compute  the  Jacobian 
of  /  for  0 ,  the  (m  —  2)  X  3  matrix  {/#},  where 


then  find  the  least  squares  solution  of 

&<)■  « 


where  the  length  of  the  correction  vector  6  is  controlled 
by  the  Marquardt  parameter  v,  whose  value  may  vary 
with  the  iterations.  The  correction  vector  S  is  then 
subtracted  from  0  for  a  new  estimate.  When  u  =  0, 
this  is  the  Gauss- Newton  method,  r  inding  the  least 
squares  solution  for  (4)  can  be  done  using  the  same 
orthogonal  decomposition  method  as  for  (2). 

We  need  to  compute  the  Jacobian  of  f(0).  The 
columns  of  this  matrix  are 

From  (3)  we  know  that  Qa{p)${0)  =  0,  so  that 


=  -Q»(P) 


8*{p) 

80 i  ' 


and  hence  approximately 


8QM 

dpj 


em 

80j 


*+</>), 


where  4»+  =  the  "generalised  inverse”  of  ♦, 

as  above.  This  simplification  is  due  to  Kaufman|3], 

It  remains  to  determine  ths  derivatives  8Q(P)/80j 
and  dy{0)/d0j.  Since  4>  and  y  are  obtained  from  the 
vectors  <f,,  we  need  to  find 


ddj  __  do,  X  R(0)bj 
80,  "  80, 


~a<  X 


8R 

IT* 


There  are  man y  ways  to  define  a  rotation  matrix  using 
three  parameters:  we  will  factor  R  into  «  oduct  of 
rotations  Ru  Ra,  and  R3  about  the  x,  y,  a_  '.  x  axes, 
rotating  0U  02,  and  0S  radians,  respectively.  These 
matrices  have  a  simple  form,  for  example 


f  10  0  \ 

Rx  =  |  6  cos/Jj  —  sinyflj  i, 

VO  sin)?!  tot 0X  J 

and  dfi/d0x  =  {8Ri/d0i)RiRa,  and  so  on. 
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Coneluiioni 

The  method  has  been  implemented  in  MACLISP 
using,  the  techniques  for  orthogonal  decomposition 
presented  in  [4].  Preliminary  computational  experience 
indicates  rapid  (10-15  iterations)  convergence  to  machine 
accuracy  when  the  algorithm  converges.  Convergence 
is  not  global  for  this  class  of  algorithms,  unfortunately, 
and  good  estimates  (within  0.3  radians,  roughly)  of  the 
components  of  /?  are  necessary. 

Further  work  might  include  the  use  of  confidence 
weights  for  the  image  point  matchings,  initial  rough  es¬ 
timators  for  /?,  and  the  use  of  convergence  acceleration 
techniques.  For  completeness,  a  corresponding  proce¬ 
dure  should  be  included  for  the  case  that  the  cameras 
are  at  the  same  location:  this  implies  that  c  =  0,  and 
minimization  of  (1)  isn’t  meaningful,  /  mother  problem 
is  that  no  translations  along  the  z  axis  are  allowed  by 
this  formulation. 

Finally,  it  should  be  noted  that  this  procedure 
does  not  facilitate  the  use  of  prior  knowledge  of  the 
translation,  so  that  in  certain  situations  a  full  five- 
variable  minimisation  might  be  appropriate. 
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ABSTRACT 

In  the  third  phase  of  its  association  with 
the  University  of  Maryland  in  the  DARPA  Image 
Understanding  Program,  Westinghouse  is  investi¬ 
gating  performance  problems  with  image 
segmentation  algorithms,  application  of  a 
knowledge  base  to  image  recognition,  and  the  use 
of  optical  flow  techniques.  These  choices  are 
based  upon  an  anticipated  demand  for  improved 
recognition  performance  for  weapon  delivery  and 
reconnaissance  systems.  Thit  aper  discusses 
current  efforts  in  each  of  these  arews,  as  well 
as  the  preparation  of  a  comprehensive  data  base 
and  the  conduction  of  meaningful  perfo.rmance 
tests. 


1.  Introduction 

The  Westinghouse  Systems  Development  Division 
is  now  entering  the  third  phase  of  its  associa¬ 
tion  with  the  University  of  Maryland  Computer 
Vision  Laboratory  in  the  DARPA  Image  Understanding 
Program.  The  Maryland  program  director  is  Prof. 
Azriel  Rosenfeld.  The  Army  program  monitor  is 
Dr.  George  Jones  of  the  Night  Vision  and  Electro- 
Optical  Laboratory,  Ft.  Belvoir. 

Highlights  of  the  first  phase  of  the 
Westinghouse  program  (1976-78)  were  the  demon¬ 
stration  of  a  CCD  histogramraer-sorter  which 
incorporated  a  special-purpose  Westinghouse  CCD 
chip;  and  the  preliminary  design  of  an  entire 
automatic  cueing  system,  using  CCD  architecture, 
in  a  3"  x  3"  x  6"  volume  (1) .  During  the  second 
phase  (1978-80)  attention  was  directed  toward 
the  hardware  implementation  of  relaxation 
algorithms  (2,3,4).  A  non-linear  prefilter 
based  upon  this  work  has  been  demonstrated,  and 
will  be  incorporated  into  the  AUTO-Q  processor 
(5) ,  as  a  replacement  for  conventional  averaging 
filters.  Additional  effort  in  this  phase  was 
directed  toward  the  use  of  array  processors  for 
algorithm  tests  where  large  data  bases  were 
involved  (4) .  The  current  phase  of  the  program 
began  in  late  1980.  It  will  expand  earlier 
efforts  in  the  evaluation  and  real-time  imple¬ 
mentation  of  image  understanding  algorithms. 

The  specific  areas  for  this  effort  are: 

•  Investigation  of  performance  problems 
with  image  segmentation  algorithms; 


•  Construction  of  a  knowledge  base  to 
improve  object  labeling; 

•  Development  of  optical  flow  techniques. 

These  problem  areas  are  considered  crucial  to 
the  success  of  future  image  recognition  programs 
which  support  reconnaissance  and  weapon  delivery 
operations.  The  reasons  for  this  assessment  will 
be  discussed  below. 

At  the  current  state  of  the  art,  a  variety 
of  segmentation  algorithms  are  available  which  . 
perform  target  extraction  quite  well  with 
“clean"  imagery,  but  which  deteriorate  rapidly 
with  the  appearance  of  noise,  clutter,  or  partial 
target  obacuration.  This  is  a  severe  bottleneck 
in  the  image  recognition  process.  Fortunately, 
some  promising  nevr  segmentation  algorithms  are 
beginning  to  appear.  Several  candidate  algor¬ 
ithms  wj.ll  be  selected  from  those  developed  at 
the  University  of  Maryland,  and  other  IU 
programs.  One  goal  of  the  current  effort  is  the 
development  of  meaningful  statistical  test  s  to 
evaluate  their  performance.  With  this  objective 
in  mir.d,  a  data  base  has  been  compiled  which 
reflects  complex  reconnaissance  and  weapon 
delivery  scenarios. 

The  belief  that  a  knowledge  base  may  be 
useful  In  image  recoginition  is  based  upon 
the  inability  of  machines  to  deal  with  some 
complex  scenes  which  humans  can  interpret  by 
using  memory,  context,  and  reasoning.  This  idea 
has  been  around  for  years,  but  with  very  little 
successful  implementation.  New  optical  flow 
techniques  offer  some  promise  of  success  in 
several  areas.  They  can  potentially  assist  in 
the  (passive)  determination  of  range  to  a  target, 
the  detection  of  target  motion,  and  the  evalua¬ 
tion  of  sensor  line  of  sight  changes.  The 
following  paragraphs  describe  the  work  which  has 
been  initiated  in  each  of  the  above  areas,  the 
selection  of  a  "•■ealis.tic"  data  base,  and  test 
plans. 

2.  Image  Segmentation 

Image  segmentation  is  a  process  of  parti¬ 
tioning  an  image  into  regions — each  having 
different  properties.  The  class  of  images 
within  this  project's  scope  are  FLIR  target/ 
background  scenes.  Thus,  segmentation  can  be 
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regarded  as  a  process  of  separating  targets  from 
background  clutter. 

The  standard  method  of  segmenting  an  Image 
is  by  gray  level  thresholding.  Here  the  classes 
correspond  to  gray  level  ranges,  e.g.  "light- 
hot"  and  "dark-cool".  Since  these  ranges  are 
not  known  in  advance,  they  must  be  determined  by 
examining  the  gray  level  histogram  and  looking 
for  peaks  (one  dimensional  clusters) ,  and 
choosing  thresholds  (one  dimensional  decision 
surfaces)  that  separate  the  peaks. 

A  number  of  investigators  have  suggested 
that  multidimensional  feature  space  should  also 
be  useful  for  segmenting  complex  gray  scale 
images.  A  variety  of  features  may  be  defined 
over  a  neighborhood  set  about  a  pixel,  e.g. 
mean,  median,  variance,  commonality,  total  var¬ 
iation  (6) .  This  approach  cou3 d  be  employed 
when  a  single  feature,  such  as  gray  level,  Is 
not  adequate  for  segmentation  because  the  given 
image  contains  a  number  of  textured  regions 
whose  gray  level  ranges  overlap. 

Initial  work  at  Westinghouse  has  indicated 
that  thresholding  by  cluster  detection  is  not 
adequate  for  separating  targets  from  background. 
Gray  level  target  and  background  clusters  are 
often  not  separable,  i.e.  their  probability 
densities  overlap.  Likewise,  the  response  of 
local  operators  tends  to  be  rather  variable,  not 
yielding  well  defined  clusters.  The  basic 
weakness  of  segmentation  schemes  which  use  only 
local  feature  values  is  that  they  attempt  to 
classify  image  parts  without  regard  to  their 
relative  positions  in  the  image.  It  should  not 
surprise  us  that  any  approach  which  does  not 
take  spatial  contiguity  fully  into  account  falls 
much  of  the  time. 

The  segmentation  algorithms  investigated  in 
our  study  are  those  that  make  use  not  only  of 
similarity  but  also  of  proximity.  The  candidate 
algorithms  for  our  study  include,  but  are  not 
limited  to,  the  following: 

•  Superslice  (7) 

•  Pyramid  spot  detector  (8) 

•  Pyramid  linking  (9) 

•  Two-label  relaxation  (10,11) 

•  Spoke  detector/segmentor  (12) 

A  data  base  of  50  FLIR  imagee  (128  x  128)  has 
been  assembled  from  Army,  Navy,  Air  Force  and 
Westinghouse  sources.  Several  images  from  this 
data  base  are  shown  in  Figure  1 .  Each  of  the 
candidate  algorithms  will  be  tested  on  the  first 
ten  images.  Those  that  perform  well  will  then 
be  tested  on  the  remainder  of  the  data  base. 

This  brings  up  the  question  of  how  to 
evaluate  the  performance  of  an  algorithm. 

Several  approaches  are  being  considered  for  use 
alone  or  in  combination.  One  approach  Is  to 
have  a  human  "hand  segment"  the  images.  The 
resulting  binary  image  plane  then  becomes  an 
estimate  of  the  ground  truth.  Another  approach 


is  to  use  synthetically  generated  images  for 
which  the  ground  truth  la  known  by  construction. 
One  such  image  has  been  Included  in  our  data 
base,  A  third  approach  la  to  feed  the  output  of 
each  of  the  aegmentors  into  a  common  classifier. 
The  classification  accuracy  is  then  assumed  to 
give  an  indication  of  segmentation  accuracy. 

The  task  of  comparing  a  number  of  segmenta¬ 
tion  algorithms  is  by  no  means  clear-cut.  The 
performance  of  each  algorithm  is  related  to  the 
type  of  noise  cleaning  done  before  or  after  each 
stage  of  its  operation.  Furthermore,  each 
algorithm  has  one  or  more  parameters  which  must 
be  adjusted.  The  optimal  performance  of  an 
algorithm  can  only  be  achieved  by  fine-tuning 
these  parameters  with  respect  to  the  class  of 
images  under  consideration.  The  robustness  of 
this  tuning  operation  la  an  extremely  Important 
consideration  in  a  military  context,  but  also 
one  which  is  difficult  to  evaluate. 

Most  segmentation  algorithms  contain  a 
number  of  processing  steps  which  run  in  sequence. 
It  may  he  possible  to  separate  the  stages  of 
operation  of  the  various  algorithms  and  combine 
them  in  different  permutations.  Presumably,  by 
careful  analysis,  one  ..ould  take  the  best  parts 
of  the  best  of  the  algorithms  and  assemble  them 
into  a  new  algorithm.  This  idea  will  be  inves¬ 
tigated  further  as  our  study  progresses. 

3.  Construction  of  a  Knowledge  Base 

The  AI  approach  to  scene  analysis  involves 
the  construction  if  a  knowledge  base  and  the 
exploitation  of  constraints  implied  therein. 
Certain  knowledge  about  the  physical  world  can 
always  be  used.  To  quote  Marr  (13): 

(Cl)  "A  given  point  on  a  physical  surface 
has  a  unique  position  in  space  at  any 
point  in  time. 

(C2)  Matter  is  cohesive;  it  is  separated 
into  objects;  and  the  surfaces  of 
objects  are  generally  smooth  compared 
with  their  distance  from  the  viewer." 

These  constraints  apply  to  location  on  a  physical 
surface.  One  approach  to  using  such  constraints 
Is  to  develop  a  primitive  scene  description  and 
then  resort  to  a  convergence  of  evidence.  This 
primitive  description  can  take  the  form  of  lines, 
edges,  corners,  blobs  (obtained  from  segmenta¬ 
tion)  ,  tilt  of  the  ground  plane  and  location  of 
the  horizon. 

Higher  level  information  can  be  incorporated 
at  a  later  stage.  Higher  level  knowledge  deals 
with  the  particular  goals  of  the  analysis  and 
domain  of  the  data.  For  example,  tanka  some¬ 
times  leave  warm  dust  trails  or  tread  tracks. 
Trucks  and  jeeps  often  travel  over  roads.  Tar¬ 
gets  teud  to  cluster  into  groups.  An  object 
floating  above  the  groui  J  is  more  likely  to  be  a 
helicopter  than  a  tank. 
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More  Information  is  contained  in  a  sequence 
of  images  than  in  a  single  snapshot.  The  motion 
of  features  derived  from  the  perspective  pro¬ 
jection  of  a  scene  onto  a  view  window  is  called 
the  scene's  optical  flow.  The  optical  flow 
results  from  a  combination  of  the  view  window's 
movement  through  (he  3-D  environment  and  the 
movement  of  scene  components  within  the  environ¬ 
ment.  Westinghouse  is  using  data  extracted  from 
optical  flows  to  provide  clues  to  local  surface 
orientation,  relative  motion,  and  depth 
relationships.  An  optical  flow  image  produced 
by  Westinghouse1 s  hardware  system  is  shown  in 
Figure  2. 

Another  source  of  information  takes  the  form 
of  a  partial  world  model  which  can  be  developed 
and  stored  off-line  and  retrieved  when  needed. 

One  good  source  is  the  Defense  Mapping  Agency's 
digital  culture  and  terrain  elevation  files.  If 
the  location  and  attitude  of  an  observer  are 
known,  then  this  data  can  be  used  to  construct  an 
initial  model  for  his  3-D"  environment .  Other 
data  of  this  type  include  the  planned  flight  path, 
weather  conditions,  and  gathered  intelligence. 
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Figure  2.  Optical  Flow  Scene  Produced  by  Westi.nghouse  AUTO-Q  Digital  Image  Processor  (IA), 
The  input  to  the  device  was  from  a  TV  camera  which  was  rotated  between 
successive  frames.  The  optical  flow  image  consists  of  9  windows,  whose  corners 
are  marked.  Each  window  contains  a  single  flow  vector. 


183 


-*•  to#*.  - -iv*. .  isw)-- iswwwMwi  nutiiM— 


A  GENERA]..  PURPOSE  VLSI  CHIP  FOR  COMPUTER  VISION 
WITH  FAULT-TOLERANT  HARDWARE 


Michael  R.  Lowry  and  Allan  Miller 

Artificial  Intelligence  Laboratory 
Stanford  University,  Stanford,  Ca.  94305  USA 


Abstract 

This  article  describes  a  VLSI  NMOS  chip  suitable 
for  parallel  implementation  of  computer  vision  algo¬ 
rithms.  The  chip  contains  a  two  dimensional  array  of 
processors,  each  connected  to  its  four  neighbors.  Each 
processor  currently  has  32  bits  of  internal  storage  in 
three  shift  registers,  and  can  do  arbitrary  boolean  func¬ 
tions  as  well  as  serial  bit  arithmetic. 

Our  objective  is  to  make  a  vision  processor  with 
one  processor  for  each  pixel.  This  will  require  a  very 
high  density  VLSI  implementation,  filiing  an  entire 
wafer  We  will  need  fault-tolerant  hardware  to  deal 
with  the  fabrication  errors  present  in  such  large  cir¬ 
cuits.  We  plan  to  do  this  by  incorporating  redundant 
links  in  the  processor  interconnections  and  routing  the 
links  around  faulty  processors. 

Current  work  focuses  on  testing  a  prototype  chip 
with  one  processor,  redesigning  the  chip  for  a  more 
compact  and  regular  layout,  and  designing  the  redun¬ 
dant  link  interconnections  and  hardware  support  for 
picture  sise  arrays  of  processors. 

Introduction 

Many  computer  vision  algorithms  consist  of 
repeating  the  same  operat  ion  over  each  local  region  of 
an  image.  This  can  take  excessive  amounts  of  time  or  a 
serial  machine,  even  for  non-production  research  pur¬ 
poses.  Current  vision  algorithms  have  reached  a  limit 
due  to  the  processing  power  of  the  computers  on  which 
they  are  running.  Further  improvements  in  computer 
vision  will  need  a  fast,  general  purpose  array  processor. 

To  achieve  greater  speed,  we  are  developing  a 
high  density  two-dimensional  array  of  general  purpose 
processors.  AH  processors  execute  the  same  instruc¬ 
tion  sequence,  controlled  by  a  common  microcode  bus, 
but  individual  processors  can  be  selectively  enabled  by 
data-dependent  conditions.  This  idea  is  not  new,  but 
past  work  has  been  limited  to  small  arrays  because  of 
low-density  and  high-cost  ardware  (1,2] 

In  a  sense,  we  are  -rying  to  achieve  the  same 
capability  as  human  visual  processing  of  doing  all  local 
operations  in  parallel  (3j.  Our  approach  differs  some¬ 
what  from  other  current  work  in  VLSI  computer  vision 
l4]  in  that  we  are  developing  general  purpose  hardware, 
as  opposed  to  hardware  for  a  particular  operation  such 
as  convolution.  Since  most  image  processing  consists 


of  a  sequence  of  linear  and  non-linear  operations,  even 
for  a  flxnd  system  a  special  purpose  chip  is  suitable 
for  only  part  of  a  total  system.  For  development  end 
testing  of  new  algorithms  general  purpose  hardware  is 
needod. 

Processor  Dtsfce 

Figure  1  a  hows  the  data  path  of  one  processor. 
It  is  important  to  note  that  the  processor  is  serial; 
in  other  words,  each  line  in  the  diagram  represents  a 
single  connection.  The  R  register  is  16  bits,  and  the 
Dl  and  D2  registers  »re  each  6  bits.  MUXl  and  MUX2 
select  the  inputs  to  the  adder.  The  adder  inputs  can 
be  selected  from  F.  shift  output,  D1  shift  output,  D2 
shift  output,  sere,  one,  and  the  latched  udder  output 
of  any  of  the  cell’s  four  neighbors.  The  output  selector 
selects  the  shift  inputs  to  R,  Dl,  and  D2.  The  output 
sec  «or  is  constructed  in  such  a  way  that  one  of  the 
three  registers  can  be  loaded  from  the  adder  output, 
and  the  other  two  registers  are  loaded  from  their  own 
shift  outputs.  In  this  way,  Dl  and  D2  can  be  summed 
into  R  without  losing  the  contents  of  either  D  register. 

The  control  path  of  the  processor  is  quite  simple. 
There  are  twelve  externally  generated  microcode  bits. 
Six  of  the  bits  reluct  the  adder  inputs  through  MUXl 
and  MUX2.  Two  of  the  bits  control  the  output  selec¬ 
tor:  either  all  three  registers  are  loaded  from  their  shift 
outputs  or  one  of  the  three  registers  is  loaded  from  the 
adder  output  and  the  other  two  are  loaded  from  their 
shift  outputs  Three  of  the  bits  independently  enable 
shifting  of  R,  Dl,  and  D2.  The  Anal  microcode  bit 
causes  the  shift  enable  tc  be  loaded  from  the  adder  out¬ 
put.  When  the  shift  enable  is  a  sero,  register  shifting  is 
disabled  regardless  of  the  state  of  tbe  microcode  bits. 
This  effectively  disables  the  processor. 

Programming  examples 

Although  the  processor  design  is  quite  simple,  it 
allows  several  fairly  important  vision  algorithms  to  be 
done  relatively  quickly  due  to  its  parallel  implementa¬ 
tion  of  local  operations. 

To  convolve  an  image  with  a  fixed  mask,  we  store 
the  greyscale  value  of  each  pixel  in  the  Dl  register  of 
one  processor  and  use  the  method  of  adding  multiples 
of  a  spatially  shifted  image  to  accumulate  the  result  in 
the  R  registers  of  the  processors.  The  shifted  image  is 


put  into  the  D2  registe.'  by  selecting  the  proper  neigh 
bor  as  one  adder  inp»t  and  sero  as  another  adder  in¬ 
put,  then  doing  eight  shifts  while  saving  the  result  in 
D2  This  shift  operation  can  be  repeated  anj  number 
of  times  in  any  direction  to  produce  a  properly  shifted 
image  in  D2.  One  convolution  step  is  then  completed 
by  adding  the  appropriate  multiple  of  D2  to  R.  For  ex¬ 
ample,  if  5  time;  the  value  of  D2  is  to  be  added  to  R, 
first  D2  is  added  to  R,  then  R  is  arithmetically  shifted 
right  two  bit;,  then  D2  it  added  to  R  again.  This  con¬ 
volution  step  it  then  repeated  for  the  entire  fixed  mask. 

lisiog  the  shift  enable,  it  is  also  relatively  simple 
to  multiply  Dl  by  D2  and  save  the  result  in  R.  R 
is  cleared,  then  Dl  is  repeatedly  added  to  R,  arith¬ 
metically  shifting  R  right  one  bit  between  additions. 
However,  just  before  each  add  operation,  the  next  bit 
in  1>2  is  loaded  into  the  shift  enable.  This  aas  the  effect 
of  only  adding  shifted  versions  of  Dl  to  R  where  the 
corresponding  bit  in  D2  is  a  one.  The  result  is  the 
desired  multiplication,  yielding  the  cross  correlation  of 
an  image  stored  in  Dl  with  the  image  stored  in  D2. 

Thresholding  a  picture  is  quite  easy,  since  it  can 
be  done  by  subtracting  the-  threshold  from  the  data 
using  two’s  complement  arithmetic.  The  final  carry 
can  be  loaded  into  the  shift  enable  and  any  threshold- 
dependent  operation  can  then  be  done. 

To  find  directional  sero  crossings,  each  processor 
retrieves  the  sign  bit  of  the  data  in  its  neighbor  in  the 
desired  direction  Processors  with  positive  data  are 
then  enabled  by  loading  the  complement  of  the  sign  bit 
into  the  shift  enable  (complementing  a  bit  is  done  by 
adding  one  to  it  after  clearing  t  he  carry).  Each  enabled 
processo*  then  loads  its  neighbor’s  sign  bit  into  the  shift 
enable.  Toe  result  it  that  only  processors  that  were 
originally  next  to  lero  crossings  with  negative  slopes 
are  enabled  (the  positive-slope  sero  crossings  can  be 
found  by  operating  in  the  opposite  direction) 

By  using  various  parts  of  the  D  registers  for  man¬ 
tissas  and  exponents,  it  is  possible  to  simulate  limited 
floating-point  operations,  although  it  is  clem  to  us  that 
eight  bits  of  data  severely  limits  both  the  range  and 
precision  of  these  numbers. 

Although  we  have  not  yet  investigated  further  al¬ 
gorithms  to  the  same  detail,  it  seems  clear  that  relaxa¬ 
tion  methods  such  as  finite  element  analysis  for  solv¬ 
ing  2-dimensional  differential  equations  are  particularly 
well-suited  for  local  processing.  In  addition,  we  are  cur¬ 
rently  investigating  a  halftoning  algorithm  that  spreads 
errors  due  to  thresholding  in  a  uniform  manner  rather 
than  a  biased  fashion  as  done  in  current  algorithms  |5], 
We  feel  that  these  diverse  applications  of  our  processor 
are  a  good  indication  of  its  generality. 

Technological  considerations 

Figure  2  shows  the  processor  layout.  Our  processor 
is  currently  approximately  square,  measuring  500  X  on 
a  side.  Since  our  test  chip  was  fabricated  with  a  2.5 
micron  X,  our  test  processor  is  1.25  mm  on  a  side. 
To  make  a  512  by  512  array  of  processors  by  simply 
replicating  this  processor  would  require  a  wafer  6-4  cm 
on  a  side.  A  reasonable  design  will  have  to  reduce  the 
linear  dimension  of  the  processor  by  10,  or  the  area  by 
100. 


Current  fabrication  techniques  can  achieve  a  1 
micron  X,  giving  us  a  factor  of  2.5  linearly,  or  6  25 
in  area.  Tne  microcode  lines  take  up  approximately 
one  third  of  the  area  of  our  processor,  so  sharing  them 
between  processor*  gives  another  factor  of  1.2  in  area. 
The  memory  in  the  ceil  can  probably  be  reduced  to 
at  least  three  fourths  of  its  current  site  (uring  stand¬ 
ard  memory  celt  designs).  Since  the  memory  currently 
takes  about  one  third  of  the  cell,  this  gives  another 
factor  of  1.1  in  area.  Our  original  design  left  much 
blank  area  in  the  interconnection  of  the  processor  ele¬ 
ments,  and  the  processor  could  probably  be  reduced  to 
two  thirds  of  its  current  site.  Since  the  processor  takes 
up  one  third  of  the  cell  area,  a  cleaner,  more  regular 
processor  design  would  give  yet  another  factor  of  about 
1.13  in  area.  Taken  together,  these  factors  mean  that 
a  512  by  512  processor  array  is  within  about  a  factor  of 
11  in  area  or  3.3  in  linear  dimension  of  being  feasible. 
We  expect  to  be  able  to  make  &  128  by  128  array  in  the 
near  future,  and  move  toward  larger  arrays  as  fabrica¬ 
tion  technology  improves. 

The  biggest  problem  with  making  a  wafer-sited 
chip  as  we  plan  to  do  is  overcoming  the  problem  of  low 
y’eld.  T  he  largest  commercially  available  chip  today 
is  the  Motorola  M63000,  which  is  about  6  mm  by  7 
mm.  Since  we  plan  on  making  a  chip  with  100  times 
the  area,  and  the  standard  yield  model  decreases  ex- 
potentially  with  area,  we  must  use  a  processor  inter¬ 
connect  scheme  that  can  tolerate  errors  in  chip  fabrica¬ 
tion.  {Specifically,  the  model  incorporates  an  "error 
density"  that  is  constant  over  the  wafer.  If  p  is  the 
probability  of  finding  no  errors  in  one  unit  area,  then 
the  probability  of  finding  no  errors  in  an  area  of  sise  A 
is  pr*,  so  if  the  area  increases  by  100  tbc  new  yield  is 
simply  the  old  yield  raised  to  the  99th  power.  If  the  old 
yield  was  10%  the  new  yield  will  be  l(T'w.j  We  plan 
to  deal  with  fabrication  errors  by  incorporating  extra 
interprocessor  connections  in  the  chip.  A  test  run  on 
the  chip  will  pinpoint  bad  processors,  then  in  actual 
use  some  of  the  extra  interconnections  will  be  removed, 
leaving  a  network  of  good  processors. 

Redundant  element  methods  were  used  in  early 
4K  RAMs,  where  each  half  of  the  circuit  would  be 
tested  independently  and  chips  with  one  bad  memory 
array  would  be  sold  as  2K  RAMs.  More  sophisticated 
methods  are  being  used  today  in  larger  memory  chips 
to  delete  single  rows  and  columns  of  bit-storage  arrays 
[6). 

We  foresee  three  major  issues  in  making  a  working 
fault-folcr&nt  design. 

We  will  need  to  know  more  about  the  statistic; 
of  chip  errors  in  order  to  have  a  reasonable  model  on 
which  to  base  yield  calculations  and  interconnection 
schemes.  Although  some  data  is  available,  it  is  mostly 
statistical  in  nature  and  is  based  on  tests  of  memory 
arrays.  Industrial  integrated  circuit  manufacturers  are 
reluctant  to  reveal  yield  statistics  of  their  fabrication 
facilities,  since  their  profits  are  dependent  on  these 
figures.  We  plan  to  design  a  circuit  that  can  address 
areas  on  its  surface  and  check  for  various  kinds  of 
fabrication  defects.  In  this  way,  wc  will  have  a  better 
model  of  fabrication  errors. 

It  will  be  necessary  to  have  a  redundant  inter¬ 
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Figure  1.  Processor  data  path. 


Figure  2.  Actual  processor  layout. 
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processor  co-moction  scheme.  We  ere  currently  con¬ 
sidering  either  having  each  processor  connect  to  nil 
eight  of  iti  neighbori  or  having  the  processor*  imaged 
in  &  triangular  irriy  with  eich  processor  connected  to 
fix  of  it*  neighbors.  By  studying  the  fault  tolerance 
of  each  of  these  topologies  and  using  the  yield  data 
from  the  tests  described  above  we  wilt  design  proces¬ 
sor  interconnections  to  give  us  reasonable  yields  on  the 
wafer-siied  circuits.  Testing  the  processors  also  affects 
the  interconnection  scheme  since  it  must  be  done  before 
any  interconnections  are  changed,  but  tbe  testing  itself 
will  rely  on  the  interconnections  to  pass  data  on  and 
off  the  wafer.  We  are  also  considering  having  buses 
along  rows  and  columns  of  the  wafer  through  which 
processors  can  communicate  with  each  other  and  the 
outside  world.  These  will  simplify  addressing  a  single 
processor  for  testing  purposes,  and  will  also  allow  rapid 
communication  between  distant  processors. 

Once  we  have  decided  on  tbe  topology  of  the  inter¬ 
connections,  we  will  seed  to  implement  these  intercon¬ 
nections.  We  are  carrei  (y  considering  three  different 
types  of  interconnections.  "Soft*  interconnections  can 
be  made  by  adding  one  more  select  input  to  MUX1 
and  M'JX2  and  controlling  these  inputs  from  a  two-bit 
regist i  By  making  certain  adder  input  selections  not 
depend  on  the  register*,  these  selection  registers  can  be 
set  to  a  known  state  after  power  is  applied  to  the  wafer. 
Testing  can  be  done  on  all  processors  and  the  results 
of  the  testing  can  then  be  used  to  load  the  selection 
registers.  "Firm”  inter' onnactions  can  be  made  using 
the  same  technology  used  in  EPROMs,  where  intercon¬ 
nections  can  be  enabled  by  storing  a  charge  in  the  oxide 
layer  over  the  base  of  a  pass  transistor.  This  charge  can 
be  removed  by  exposing  the  wafer  surface  to  ultraviolet 
light,  allowing  occasional  reconfigurations  of  the  chip. 
“Hard"  interconnections  can  be  made  u»iug  this  fusible 
metal  links  (the  same  technology  used  in  PROMs).  We 
are  also  investigating  laser  technologies  for  making  and 
breaking  links  on  wafer  surfaces  as  a  final  processing 
step  17). 


research  tool  for  developing  more  powerful  vision  algo¬ 
rithms  than  current  computing  facilities  allow.  Tbe 
system  is  particularly  well-suited  for  algorithms  in¬ 
volving  local  operations,  and  can  be  used  to  imple¬ 
ment  a  large  variety  of  algorithms  using  a  genera) 
purpose  processing  element  We  are  using  a  fault- 
tolerant  processor  interconnection  scheme  to  achieve 
high  enough  yield  for  this  full- wafer  circuit. 
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Performance  Estimate 

The  performance  improvements  our  system  will 
provide  are  impressive.  A  Digital  Equipment  KL-10, 
which  is  a  6000-chip  ECL  design,  can  do  a  3b- bit  ad¬ 
dition  in  520  ns  if  both  operands  are  in  the  cache  [8] . 
If  we  define  an  "operation*  to  be  an  8-bit  addition, 
the  KL-10  can  do  8.7  million  operations  per  second. 
Our  preliminary  wafer  will  have  16,384  processors  on 
it,  so  each  proceuoi  will  need  to  do  530  operations  per 
second  to  match  the  KL-10.  This  represents  a  clock 
rate  of  4.2  KHs,  since  each  8-bit  addition  requires  eight 
clock  cycles.  Although  timing  tests  are  still  in  a  very 
preliminary  stage,  we  expect  to  be  able  to  clock  our 
device  internally  at  dote  to  10  MHs,  resulting  in  a  per¬ 
formance  improvement  of  at  least  2500.  The  actual 
figu’e  will  probably  be  much  larger  since  our  system 
will  avoid  the  overhead  of  computing  addresses. 

Ceadariea 

By  making  an  image-sited  two-dimensional  array 
processor  on  a  single  silicon  wafer,  we  are  building  a 
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ABSTRACT 

This  paper  describes  recent  work  undertaken 
at  Hughes  Research  Laboratories,  Malibu,  Cali¬ 
fornia,  in  support  of  the  DARPA  Image  Under¬ 
standing  (IU)  program.  The  principal  goal  of 
the  work  is  to  investigate  the  application  of 
VLSI  technologies  to  IU  systems  and  identify 
processor  candidates  well  suited  to  VLSI  imple¬ 
mentation.  One  candidate  that  is  very  well 
suited  to  the  VLSI  technology  is  a  programmable 
local-area  processor  with  residue  arithmetic 
based  computations.  The  design  and  development 
of  this  processor,  which  operates  on  5x5  kernel, 
are  described.  Of  significant  interest  is  an 
LSI  custom  circuit  that  we  are  developing  and 
which  will  perform  the  bulk  of  the  residue  computa¬ 
tions.  In  addition,  an  interface  that  will 
permit  this  processor  to  be  controlled  by  a 
general-purpose  host  computer  (e.g.,  PDP  11/34) 
is  described. 


1 .  INTRODUCTION 

Our  previous  work  (1,2)  in  developing  image 
understanding  architectures  has  concentrated  on 
the  analysis  of  the  processing  functions  required 
for  special-purpose  LSI  primitives.  We  have 
developed  about  16  fixed  and  programmable  primi¬ 
tives  for  real-time  operation. 

The  work  described  here  represents  a  signifi¬ 
cant  shift  in  emphasis  and  an  increase  in  capa¬ 
bility.  First,  we  have  undertaken  a  detailed 
design  and  analysis  of  a  number  of  complex  process¬ 
ing  operations ,  includin'*  line— f lndina  (3)  end 
texture  analysis  (4).  This  work  has  been  carried 
out  specifically  with  LSI  and  VLSI  implementation 
in  mind.  Hence  issues  such  as  chip  and  function 
partitioning,  data  flow,  local  storage,  and  word- 
length  are  specifically  emphasized.  The  results 
of  the  systems  analysis  and  design  for  these 
operations  are  included  in  Section  2.  From  this 
work  we  have  been  able  to  configure  a  fully  inte¬ 
grated  real-time  processor  for  each. 

Of  equal  Importance,  and  perhaps  greater 
impact  to  military  systems  and  robotics,  we  have 
configured,  designed,  and  started  to  fabricate  a 
VLSI  processor  that  can  form  the  basis  of  a  fully 


prograaaaable  image  understanding  system  compatible 
with  conmierclally  available  host  machines. 

The  architecture  itself  uses  residue  arithmetic  (5) 
to  provide  a  highly  regular  and  extendable  struc¬ 
ture.  These  issues  are  of  great  importance  in  the 
emerging  VLSI  era  where  design  time  and  the  ability 
to  amortize  the  fabrication  cost  of  many  processors 
are  essential  elements.  The  VLSI  processor  now 
under  development  is  configured  on  a  single  board 
with  multiple  copies  cif  a  single  custom-built  nMOS 
chip.  Our  estimates  indicate  that  the  processor 
will  perform  between  50*  and  751  of  the  operations 
for  line  finding  and  texture  classification. 

The  modular  nature  of  the  machine  can  provide 
essentially  variable  precision  as  discussed  in  Sec¬ 
tion  3.  The  single  custom-built  chip  has  a  com¬ 
plexity  equivalent  to  approximately  6,500  tran¬ 
sistors.  However,  with  decrease  in  design  rules 
from  the  present  5  pm  to  submicron  we  can  antici¬ 
pate  building  a  single  chip  with  some  80,000  tran¬ 
sistors  and  design  the  full  system  around  four 
identical  chips. 

A  significant  advantage  of  our  approach  is  the 
compatibility  with  general  purpose  host  machines, 
such  as  the  DEC  series,  which  are  widely  used  in 
image  analysis  and  understanding.  We  have  there¬ 
fore  spent  considerable  effort  in  developing  a 
UNIBUS  interface  so  that  the  machine  can  be 
accessed  through  the  host  software.  With  the 
addition  of  the  local  area  logic  processor,  to  be 
developed  in  the  next  phase,  we  expect  to  demon¬ 
strate  a  fully  pjogrammable  real-time  processor. 

2.  SYSTEMS  ANALYSIS  AND  DESIGN 

The  effective  explortatlon  of  VLSI  technolcgy 
in  Image  understanding  systems  requires  that  the 
processors  developed  be  used  in  as  wide  a  range  of 
systems  as  possible.  This  requires  that 
a  wide  variety  of  systems  be  analyzed  for  the  pur¬ 
pose  of  determining  commonality.  Our  approach  has 
been  to  select  three  representative  systems  to 
analyze;  a  line  finder,  a  texture  analyzer,  and 
a  segaenter  (6).  Each  system  was  studied  and  then 
a  directed  graph  depicting  the  data  flow  was  pro¬ 
duced  (7).  The  directed  graph  had  nodes  that  were 
functionally  complex,  so  the  next  step  was  to  per¬ 
form  a  logic  design  for  the  systems  to  determine  the 
complexity  of  the  nodes  and  of  the  systems.  The 
logic  design  was  done  for  the  line-finder  and  the 
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texture  analyser  systems.  For  detalla  of  each 
design,  ree  JSC  IF  Report  990  (8).  A  brief  summary 
of  th>?  results  for  each  system  are  presented  below. 

The  llne-flnder  system  decomposes  Into  four 
major  functions*  edge  detection,  edge  thinning, 
edge  linking,  and  edge  tracking. 

A  design  was  generated  for  each  of  these 
functions  and  Table  1  presents  the  number  of  gates 
each  required.  Similarly,  the  ter.ture  analysis 
system  decomposed  into  five  major  functions: 
small-window  convolution  (5:i5),  small-window 
atatiatical  calculation,  scaling,  large-window 
statistical  calculation,  and  linear  transformation. 
Table  2  presents  the  gate  count  for  each  system. 


TABLE  1.  GATE  COUNT  FOB  LINE  FINDER  SYSTEM 


EDGE  DETECTION 

178K 

THINNING 

180  GATES 

EDGE  UNKING 

500 

EDGE  TRACING 

12  MBIT  MEMORY 

♦5K  LOGIC  GATES 

TABLE  2.  GATE  COUNT  FOR  TEXTURE 
CLASSIFICATION  SYSTEM 


6  LINE  KERNEL  GENERATION 

15K  GATES 

5x6  CONVOLUTION 

27K  GATES/ 
CHANNEL 

5x5  VARIANCE 

10K  GATES 

NORMALIZATION 

IK/CHANNEL 

LARGE  WINDOW 

STATISTICAL  CALCULATION 

8.25K/CHANNEL 

TRANSFORM  <M  INPUT 
CHANNELS! 

2.1  KM/OUTPUT 
CHANNEL 

From  the  results  of  the  directed  graph  analy¬ 
sis  and  the  logic  design,  it  is  obvious  that  the 
function  uoaon  to  all  three  aystwas  and  the  most 
complex,  'when  measured  by  the  <i<aiber  of  gates,  la 
the  small  window  (5x5)  convolution.  This  supports 
our  deciuon  to  build  a  prograamable  5x5  local-area 
processor  as  the  baaic  VLSI  module. 


3.  A  RESIDUE- BASED  IMAGE  PROCESSOR 

The  work  described  In  Section  2  motivated  the 
draign  of  a  low-level  processor  that  could  perform 
the  computationally  intensive  low-level  operations 
for  each  of  the  three  systems  investigated.  In 
addition  to  fulfilling  the  requirement)1  of  the  three 
systems,  we  also  wanted  to  select  a*-,  architecture 
that  could  he  extended  to  take  advantage  of  the  VLSI 
design  and  processing  capabilities  that  are  cur¬ 
rently  being  developed.  The  architecture  we 
selected  is  baaed  on  the  technique  of  residue 
art thmet ic . 

3.1  Processor  Deecrlptlon 

We  implemented  our  local  area  p-ocessor  In 
residue  arithmetic  to  take  advantage  of  modularity, 
and  hence  ease  of  design,  within  the  VLSI  chip  and 
extendabl 1 itv  to  handle  arbitrary  dynamic  range  and 
accuracy.  The  technique  relies  on  the  conversion, 
prior  to  computation,  of  all  the  data  to  relatively 
prime  bases  (we  chose  31,  29,  23,  and  19)  and  the 
subsequent  decoding  of  the  processed  data  back  to 
binary  numbars.  If  this  overhead  la  accepted  then 
the  arithmetic  Itself  la  reduced  both  in  complexity 
and  in  required  dynamic  range.  This  enables  us  to 
use  look  up  tables,  which  in  our  case  are  program¬ 
mable  RAM,  tc  perform  the  necessary  arithmetic. 
Regularity,  ease  of  VLSI  design,  and  function  den¬ 
sity  on  the  chip  are  significant  advantages.  Thus 
this  approach  la  ideal  for  VLSI  irnpleamntatlon. 

A  block  diagram  of  a  general  residue  processor 
is  shown  in  Figure  1.  Some  of  the  advantages 
(e.g.,  modularity)  and  disadvantages  (encoding, 
etc.)  of  this  technique  are  clearly  visible  in  this 
represent at  ion .  The  encoding  and  decoding,  when 
compared  to  a  binary  procaaaor,  are  overhead  func¬ 
tions  and  can  be  the  major  disadvantage  of  a  residue 
processor.  However,  this  overhead  cost  can  be 
reduced  if  enough  computations  can  be  performed 
while  In  the  residue  representation,  and  hence  the 
encoding  and  decoding  erm  be  amortised  over  a  large 
computation  bate.  The  clear  advantage  of  this  type 
of  processor  is  ita  natural  parallelism.  Each  par¬ 
allel  computation  channel  la  independent,  requiring 
no  communication  with  its  neighbors  until  the  con¬ 
version  trom  the  residue  representation  to  a  binary 
representation  is  performed. 

3.1.1  Kernel  Generation  and  Encoding 

Typically,  the  input  to  an  image  processor  is  a 
string  of  eight-bit  data  values  ganaratad  by  • 
raster  scan  o'  the  Image.  Therefore,  we  must 
Include  in  the  processor  the  means  for  generating 
the  two-dimensional  kernel.  This  kernel  generation 
function  is  most  easily  accomplished  using  a  aeries 
of  shift  registers.  For  a  five-line  kernel,  four 
shift  registers,  each  one  containing  as  many  ele¬ 
ments  as  there  are  pixels  in  a  line,  are  required  to 
generate  five  adjacent  lines  of  video.  For  our  par¬ 
ticular  application  the  shift  registers  are  8  bits 
wide  and  512  elements  long. 
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Figure  1.  General  Structure  For  Residue  Processor 

Before  the  Input  data  can  be  processed  by  a 
residue  processor  it  must  be  converted  from  a 
binary  representation  to  a  residue  representation. 
This  conversion  requires  t,hat  we  calculate 

(X  mod  B, ,  X  mod  B  ,  ....  X  mod  B  ) , 
l  z  n 

where  X  is  the  value  of  the  input  data  and  B^  is 
the  ith  base.  For  our  case,  since  we  operate  on  a 
5x5  kernel,  we  must  perform  this  calculation  on  the 
input  for  each  of  five  lines  of  video  and  for  each 
of  four  processors  (equivalent  ot  the  four  bases). 
The  simplest  way  to  perform  this  calculation  for  a 
general  set  of  bases  is  to  use  read  only  memories 
'ROMs),  By  connecting  the  input  data  to  the 
a  idre>s  lines  of  the  ROM  and  looking  at  the  data 
l^nes  of  the  ROM  for  the  output,  a  lcjk-up  function 
is  performed.  For  our  particular  processor,  which 
will  support  an  eight-bit  input  dynamic  range  and 
bases  that  can  be  encoded  in  five  bits  or  less, 
the  sile  of  an  encoding  ROM  is  256x5  bits.  Fig¬ 
ure  2  shows  the  block  diagram  for  the  kernel  gener¬ 
ation  and  encoding  portion  of  a  four-base  five-lino 
processor.  Each  of  the  five  ROMs  for  each  base  are 
programmed  identically. 

An  alternative  way  to  perform  these  two  func¬ 
tions  would  be  to  encode  before  the  kernel  is 
generated.  The  major  drawback  of  this  technique  is 
that  the  memory  requirements  are  much  greater  for 
the  kernel  generation  process.  For  the  system  we 
are  currently  constructing,  each  line  delay  would 
need  to  be  20  bits  wide  as  opposed  to  eight  bits 
wide  fer  the  method  we  chose. 

3.1.2  A  Programmable  Residue  Computation 
LSI  Circuit 

The  actual  computatlona  on  the  image  data  will 
be  performed  by  a  custom  LSI  circuit  which  is  cur¬ 
rently  being  processed  at  the  Hughes  Carlsbad 
Research  Center.  The  circuit  will  process  a 
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Figure  2.  Kernel  Generator  Encoding 
for  5x5  Processcr 
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where  y  1b  the  output  value,  Xi  is  the  five 
elements  In  the  kernel,  and  ft  represents  polyno¬ 
mial  functions  of  a  single  variable. 

A  functional  block  diagram  of  the  circuit  is 
shown  in  Figure  3.  T«e  word  sire  for  this  is  five 
bits,  which  limits  the  prime  bases  used  to  a  value 
of  32  or  less.  The  circuit  is  designed  to  accept  a 
five-bit  Input  word  which  is  clocked  into  a  five- 
element  shift  register.  The  contents  of  each 
register  element  Is  then  shifted  to  the  next  regis¬ 
ter.  The  five-bit  data  in  aach  of  the  shift  regis¬ 
ter  elements  is  used  to  address  a  look-up  table, 
which  is  a  32x5  random  access  memory  (RAM).  This 
look-up  operation  performs  »  unary  operation  such 
as  n  multiplication  by  a  constant  or  a  squaring 
operation.  The  outputs  of  the  five  RAMs  ara  than 
sunned  modular ly  to  produce  a  five-bit  output,  the 
base  of  the  modular  addition  being  progratasable  by 
external  control  of  the  circuit.  Since  the  look¬ 
up  tables  that  perform  the  unary  operation  ara 
composed  of  RAMs,  the  circuit  can  be  programmed 
for  many  different  computet  lore,  such  as  different 
weights  for  a  convolution  or  different  powers  of 
a  number  for  a  statistical  calculation. 


IN  * 


Figure  3.  A  Functional  Block  Diagram  for 
5x1  Residue  Processor  Circuit 

A  detailed  schematic  of  the  circuit  is  shown  in 
In  Figure  A.  In  addition  to  having  five  bits  of 
input  daca  and  five  bits  of  output  data,  an  addi¬ 
tional  set  of  data  lines  Is  included  In  this  design. 
These  data  lines,  which  are  bi-directional,  serve  a 
multipurpose  role  for  control  and  testing.  When 
used  as  Input  data  Unas  they  can  be  used  to  pro¬ 
gram  the  base  cf  the  modular  addition  and  to  pro¬ 
gram  any  of  the  five  look-up  tables.  '-Then  used  aw 
output  data  lines  they  can  read  the  look-up  tables 
to  verify  the  operation  of  the  circuit. 


This  circuit  It  being  fabricated  using  the 
nhOS  technology  and  haa  baen  designed  to  accept  a 
10  MHk  data  rate.  To  achieve  this  data  rate,  pipe¬ 
line  techniques  were  ueed,  and  the  resulting  latancy 
for  thla  circuit  la  Sevan  clock  cyclaa.  The  circuit 
will  be  packaged  In  a  28-pln  dual  ln-llnt  package. 
Figure  5  Is  a  photograph  of  the  layout  of  the  chip, 
which  will  be  available  for  testing  in  April  1981, 

To  utilize  this  circuit  (with  a  5x1  karnal)  In 
a  5x5  local  area  proceaaot ,  multiple  copies  of  the 
circuit  need  to  be  used  aw  well  aa  additional  logic 
to  combine  the  outputs  of  the  individual  circuits. 
For  each  base,  five  of  these  circuits  ara  used,  one 
for  each  line  of  the  kernel.  In  addition,  four 
1 ,02Ax5-bit  ROMs  are  used  to  sum  the  outputs  of  the 
five  circuits,  ROMs  are  used  Instead  of  adders 
because  the  additions  must  be  done  modularly.  Fig¬ 
ure  6  shows  the  block  diagram  of  the  processor, 
including  the  encoding  and  computation  portions. 

3.1.3  Decoding 

The  last  portion  of  the  processor  Is  concerned 
with  the  conversion  from  the  residue  representation 
to  a  binary  representation.  This  conversion  could 
certainly  be  done  the  same  way  ae  the  encoding,  by 
table  look-up,  but  there  is  a  severe  problem  with 
that  approach.  For  our  particular  system,  to  con¬ 
vert  four  5-blt  values,  the  decoder  would  require  a 
memory  1  million  elements  wide  with  each  element 
being  17  bits  deep.  This  table  la  certainly  attain¬ 
able  but  the  approach  la  not  extendable.  If  an 
extra  base  la  required,  so  that  five  5-bit  valuea 
nead  to  be  converted,  thr  memory  requirements 
Increase  to  33  million  elements,  each  being  greater 
then  20  bits  deep. 

There  are  two  conversion  methods  that  do  not 
require  these  large  meaurles.  One  is  baaed  on  the 
Chinese  remainder  theoram  and  the  other  la  based  on 
a  mixed  radix  repreaentatlon.  (For  a  complete  dis¬ 
cussion,  refer  to  Ref.  9.)  Thla  paper  will  focus 
only  on  the  particular  Implementations  of  these 
techniques  and  the  rationale  for  selecting  one  over 
the  other. 

To  be  able  to  reasonably  discuss  either  of  the 
two  conversion  methods  some  notation  muse  be  intro¬ 
duced.  If  B  is  the  base  vector  whose  elements  are 
the  bases  used  for  the  computations, 

B  -  (bj,  b2 . bk), 

R  la  a  scalar  whose  value  la  equal  to  the  dynamic 
range  of  the  processor,  which  is  given  by 

-TK 

i-i 

and  X  la  the  value  we  wish  to  encode  Into  the 
residue  representation,  then  RX,  the  vector  whose 
elements  are  the  data  valuea  for  each  of  the  com¬ 
putation  channels,  la  given  by 

RX  -  (rxj,  r»2,  ....  rx^) 
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where 

rxt  -  X  MOD  blt  1-1  to  k 

The  Chinese  remainder  conversion  process  Is  based 
on  the  following  property.  If 

KX  ■»  (i'Xj ,  rn2>  r*3,  rx^) 

then 

RX  -  t  (rXj ,rx2,0,0)  +  (O.O.rXj.rx^) 1  MOD  R. 

Figure  7  shows  a  system  which  performs  this 
conversion  and  which  requires  only  two  blocks  of 
memory  1,024  elements  wide  and  two  adders.  The 
adders  need  only  be  as  large  as  the  accuracy 
required  of  the  system.  Typically,  for  image 


processing  systans,  the  output  dynamic  range  and 
the  Input  dynamic  range  are  jqual  and  thus  the 
adder  complexity  can  be  relatively  small. 

The  second  conversion  scheme  considered  Is 
based  on  the  mixed  radix  method,  but  la  simplified 
by  the  fact  that  the  output  dynamic  range  ca.  be 
approximately  eight  bits.  The  method  can  be 
explained  by  imagining  sn  iterative  process  where 
at  every  iteration  tne  smallest  bsas  is  eliminated 
by  dividing  the  value  by  that  base.  Dividing 
essentially  reduces  the  dynamic  range  of  the  value 
and  thus  eliminates  the  need  for  the  extra  base. 

Of  course,  since  we  are  limited  to  a  etrlctly  Inte¬ 
ger  system,  we  must  make  aure  that  the  value  is 
evenly  divisible  by  the  smallest  base.  This  can  be 
done  by  rounding  up  or  down  so  that  the  olnaent  in 
the  residue  vector  for  that  base  is  zero.  Figure  8 
shows  an  architecture  for  a  four-base  systoa  that 


Figure  4.  Schematic  of  5x1  Residue  Circuit 
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Figure  5.  Photograph  of  CRC  181  Layout 


Figure  6.  Structure  of  5x5  Processor 
Utilizing  5x1  Processor  Circuits 


Figure  7.  Chinese  Remainder  Theorem  Residue 
Decoder  (A  Base  System,  5  Bits/Base) 
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Figure  8.  Mixed  Radix  Based  Residue  Decoder 
(4  Base  System,  5  Blts/Bnse) 


performs  this  mixed  radix  like  conversion.  At  the 
bottom  level  of  this  tree  structure  the  fourth  base 
is  eliminated.  At  the  next  highest  level  of  the 
tree  the  third  base  is  eliminated.  Finally  we  are 
left  with  two  base  values  that  can  be  decoded  with 
a  simple  look-up  table.  Thin  system  has  been  simu¬ 
lated,  and  the  computer  programs  exist  that  can 
generate  the  contents  of  the  ROMs  for  this  conver¬ 
sion  process  for  an  arbitrary  set  of  bases. 

We  chose  the  mixed-radix-hased  conversion 
process  to  he  implemented  for  our  processor  for  twc 
reasons.  First,  the  method  does  not  require  any 
logic  other  than  ROMs.  This  tends  to  make  it  more 
flexible  and  reliable.  Second,  the  method  appears 
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to  he  easily  extended  to  more  bases,  by  simply 
extending  the  decoding  tree.  Kxtendlng  the  Chlnese- 
rema lnder-based  process  would  require  either  larger 
ROMs  or  more  adders,  either  way  being  less  attrac¬ 
tive  way  than  the  mixed- rad  lx- type  conversion. 

3,1.4  Programming  and  Cont rol^ 

A  major  problem  in  the  fabrication  of  thin 
processor  Is  gaining  access  to  each  of  the  20 
custom  residue  chips  for  the  purpose  of  programming 
the  look-up  tables,  lia.-h  chip  has  three  address 
lines  to  select  one  of  five  RAM  structures,  a  read/ 
write  line,  the  five  bl-d irect lonal  programming 
data  lines,  and  a  control  line  for  the  data  line 
drivers.  These  10  control  lines  must  be  brought 
out  for  each  of  the  20  custom  chips  for  a  total  of 
200  control  lines  foi  the  purpose  of  programming. 
However,  by  bussing  lines  where  possible  and  by 
using  a  peripheral  interface  chip,  the  Intel  8215, 
the  number  of  lines  that  are  actually  brought  out 
of  the  processor  is  reduced  to  16. 

Figure  9  shows  the  structure  that  will  be  used 
to  program  the  processor.  The  three  address  lines 
and  the  bus  driver  control  lines  are  brought  from 
each  custom  chip  to  an  8255.  One  8255  Is  able  to 
control  five  of  the  custom  chips,  si  ace  the  8255 
has  24  lines  available  through  three  8-blt  ports. 
Thus  tour  8255s  are  required  to  control  all  20  of 
the  custom  chips.  In  addition,  the  f  tve  program 
data  lines  and  the  rcad/wrlte  line  are  bussed 
between  each  of  the  chips  and  these  six  lines  are 
brought  to  a  fifth  8255,  The  eight  input  data 
lines  of  the  8255s  are  bussed  as  well  as  the  two- 
bit  port  select  address  lines.  Finally,  we  bring 
out  each  of  the  5  chip  select  lines  to  a  binary 
decoder,  allowing  selection  of  a  single  8255  using 
three  control  Knes. 

To  use  this  structure  to  program  a  given  RAM 
element  In  the  processor  requires  the  following 
steps.  Initially,  the  fifth  8255  is  selected  and  the 
data  to  be  programed  are  written  to  the  port  t  on- 
talnlng  the  five  program  data  lines.  Next,  the 
8255  that  controls  the  chip  that  the  desired  RAM 
element  is  on  is  selected,  and  the  code  to  select 
the  desired  RAM  structure  is  written  to  the  proper 
port.  Next,  the  address  of  the  desired  RAM  element 
is  provided  on  the  input  data  line  of  the  processor, 
and  then  the  address  is  shifted  so  that  it  is 
addressing  the  proper  RAM  structure.  Finally,  the 
write  data  line  is  strobed  to  complete  the  program¬ 
ming  sequence.  This  sequence  can  be  accomplished 
by  three  16  jit  data  transfers.  For  a  processor 
using  31,  29,  23,  and  19  as  the  modular  bases,  a 
total  of  2,550  RAM  elements  need  to  be  programmed. 
Thus,  if  three  word  transfers  are  required  for  ench 
RAM  element,  a  total  of  7,650  word  transfers  are 
required  to  completely  program  the  processor. 

The  processor  that  involves  all  of  the  func¬ 
tions  described  above  Is  currently  being  fabricated. 
The  majority  of  the  electronic*  will  be  on  a  single 
wlrewrap  board,  but  the  line  delays  will  be  In  a 
separate  box.  A  picture  Illustrating  the  progress 
on  the  wiring  of  the  board  is  shown  in  Figure  10. 


PORT 
ADDRESS 
A0,  A1 


Figure  9.  SI ructure  for  Programming  and  Control 
of  Residue  Processor 
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Figure  10.  Photograph 
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A  VI. SI  Version  of  a  Residue 
Computation  Chip 


Even  though  the  custom  chip  is  performing  the 
bulk  of  the  computations,  some  extra  circuitry  is 
still  required.  Most  of  the  extra  circuitry  is 
necessary  because  we  are  using  a  custom  circuit 
based  on  a  5x1  kernel.  The  next  step  then,  once 
this  processor  ic  tested  and  demonstrated,  is  to 
develop  a  5x5  residue  custom  circuit  and  to  fabri¬ 
cate  a  processor  that  would  utilize  these  VLSI 
circuits . 


includes  four  delay  lines,  probably  512  elements 
long,  25  registers  for  generating  the  5x5  window, 

25  RAM  structures,  24  modular  adders,  and  Borne 
lie  ay  stages.  In  addition,  programming,  control, 
anc  testing  of  this  circuit  must  be  considered.  A 
great  deal  about  these  issues  can  be  learned  by 
using  the  processor  currently  being  fabricated. 

Both  the  current  chip,  the  5x1,  and  the  next 
chip,  the  5x5,  have  been  sized  to  net  a  quantita¬ 
tive  measure  of  their  complexity.  The  CRC  181  has 
a  device  count  of  approximately  6,500,  of  which  the 
RAM  portions  of  the  circuit  take  up  4,500  devices. 
For  the  5/5  custom  circuit  the  device  count  will 
increase  to  80,000.  One  of  the  reasons  for  the 
high  device  count  is  the  addition  of  the  line 
delays,  which  account  for  50,000  devices  (for 
static  memory  cells).  The  total  number  of  devices 
for  random  logic  is  about  7,000,  which  is  low  con¬ 
sidering  that  the  circuit  will  have  a  throughput  of 
500  million  operations  per  second. 

As  stated,  going  to  a  5x5  circuit  will  greatly 
reduce  the  extra  circuitry  required  to  construct  a 
5x5  processor.  Figure  12  shows  a  block  diagram  of 
a  system  utilizing  a  5x5  circuit.  With  the  VLSI 
circuit  the  package  count  for  the  data  flow  portion 
of  the  processor  will  be  only  14.  This  la  compared 
to  a  package  count  in  excess  of  one  hundred  for  the 
current  design.  The  power  ana  size  will  be  greatly 
reduced,  theieby  permitting  the  processor  and  the 
DEC  UNIBUS  interface  to  be  put  on  a  single  card. 

3.1.6  Functional  Capabilities 

Although  the  primary  motivation  of  developing 
this  processor  was  that  the  systems  we  investigated 
required  5x5  convolutions,  the  processor  it  capable 
of  performing  a  wider  range  of  computations.  The 
reason  for  this  flexibility  is  that  we  used  a  look¬ 
up  table  to  perform  a  unary  operation,  and  the 
table  is  completely  programmable.  The  general  form 
of  the  computation  that  can  be  performed  by  the 
processor  is 

25 

y  “  W  ’ 
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Ideally,  the  5x5  circuit  should  include  the 
circuitry  for  generating  the  five-line  kernel. 

Tills  would  mean  that  there  would  be  five  bits  for 
irput  and  five  bits  for  output.  If  the  kernel  gen¬ 
eration  circuitry  is  performed  off  of  the  chip, 

25  lines  for  input  would  be  required,  or  at  least  a 
high-speed  multiplexer  would  need  to  be  on  the  chip. 
On  the  ether  hand,  if  a  simple  line  delay  is  used 
to  generate  the  five-line  kernel,  the  circuit  would 
be  too  inflexible,  since  it  may  not  be  suited  to 
certain  aopllcatlons .  This  can  be  avoided  by  aug¬ 
menting  the  shift  register  with  logic  to  control 
the  clocking.  By  multiple  clocking  the  shift  regis¬ 
ters  can  be  made  to  delay  any  length  up  to  the 
maximum  length.  In  other  words,  we  would  b"  con¬ 
structing  an  elastic  delay  line. 

Figure  11  shows  a  block  diagram  for  the  data^ 
flow  of  a  5x5  residue  circuit.  This  circuit 


where  y  (s  the  output ,  the  Xj  represents  the  25  ele¬ 
ments  in  the  5x5  kernel,  and  fj  represents  polyno¬ 
mial  functions  of  a  single  variable.  Each  fj  is 
completely  arbitrary  and  need  not  have  any  relation 
to  the  other  f By  selecting  subsets  of  the  fj  to 
be  identical  to  zero,  we  can  program  the  processor 
to  perform  point  transforms,  one-dimensional  trans¬ 
forms  of  any  size  up  to  5x1,  and  two-dimensional 
transforms  of  any  size  up  to  5x5.  Table  3  lists 
some  of  the  functions  that  can  be  performed  by  this 
processor. 

4.  UNI BUS  INTERFACE 

Tlie  processor,  as  mentioned  before,  Is  designed 
to  accept  data  at  100  nsec  intervals.  The  reason 
for  this  high-speed  design  is  to  allow  real-time 
stand-alone  operation.  This  means,  however,  that 
when  c.ie  processor  is  used  as  a  peripheral  device 
attached  to  a  general  purpose  computer,  the  data 
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Figure  11.  Data  Flow  for  5x5  Residue  Custoi*  Circuit 
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Figure  12.  Reeidu*  Proces.or  Based  on 
5x5  Cu*to»  Chip 


TABLE  3.  FUNCTIONAL  CAPABILITIES  OP  RADIUS 

10WM4 


POINT  OPERATIONS 

POLYNOMIAL  FUNCTIONS 
CONTRAST  ENHANCEMENT 


1  -  DIMENSIONAL  OPERATIONS 

INTEGER  COEFFICIENT  TRANSFORMS 
POl  YNOMINAL  FUNCTIONS 

2  -  DIMENSIONAL  OPERATIONS 

EDGE  ENHANCEMENT 
STATISTICAL  DIFFERENCING 
LOW  PASS/HIGH  PASS  FILTERING 
SHAPE  MOMENT  CALCULATIONS 
STATISTICAL  MOMENT  CALCULATIONS 
INTEGER  COEFFICIENT  TRANSFORMS 
TEXTURg  ANALYSIS 


P 
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transfer  will  be  limited  by  the  memory  cycle  of  the 
geneval-purpose  computer  and  not  by  the  processor 
speed.  This  means,  that  to  get  optimal  use  of  the 
processor,  we  need  the  fastest  type  of  transfer 
available  between  the  processor  and  the  main  memory, 
where  the  data  to  be  processed  will  reside.  The 
direct  memory  access  (DMA)  type  of  transfer  is  tho 
fastest  type  of  transfer  that  a  general-purpose 
computer  can  support,  since  it  does  not  require 
processor  intervention.  For  DEC  UNI BUS  applica¬ 
tions,  the  fastest  data  rate  one  could  expect  is 
approximately  1  MHr. 

The  type  of  interface  we  design  should  then  be 
able  to  provide  a  DMA  transfer  capability  for  both 
the  programing  data  and  the  image  data.  For  either 
type  of  transfer,  it  is  essential  that  the  Interface 
be  controlled  to  select  the  memory  location  from 
which  the  data  are  to  be  transferred  and  to  select 
the  number  of  words  to  transfer.  For  program  data 
transfers,  the  Interface  will  only  be  required  to 
transfer  one  way  at  any  time.  The  transfer  will  be 
to  the  processor  while  in  a  program  mode  and  from 
the  processor  while  in  a  teat  mode.  For  image  data 
transfers  the  interface  must  be  able  to  transfer 
data  both  w*ys,  for  input  and  output,  'the  simplest 
alternative  .o  handle  this  bi-directional  transfer 
of  data  (from  a  hardware  point  of  view)  is  to  trans¬ 
fer  the  output  decs  to  the  same  aeonry  location  the 
input  deta  came  from,  i.e.,  write  the  output  image 
over  the  input  image . 

DEC  devices  exist  that  can  provide  the  DMA 
transfer  capability  as  well  as  provide  several  con¬ 
trol  lines  to  the  peripheral  device  to  allow  multi¬ 
ple  transfer  modes.  One  such  device  is  the  DEC 
DRUB  UNIBUS  parellel  Interface.  Our  plen  is  to 
use  this  device  to  provide  the  DMA  capability  and 
to  design  a  custom  Interface  to  permit  the  epecific 
transfer  modes :  The  arrangement  suggested  is  shown 
in  Figure  13. 

The  custom  Interface  will  need  to  interpret 
the  control  lines  from  the  DR11B  and  decide  If  the 
transfer  is  for  program  data  or  image  data.  If  it 
is  program  data,  the  interface  will  simply  pass  the 
data  to  the  16  program  data  lines.  If  it  is  an 
image  data  transfer,  then  the  custom  interface  is 
more  complex.  Since  it  la  a  16-bit  transfer,  the 
date  will  contain  two  pixels.  So  following  the 
transfer,  the  Interface  must  first  pass  one  byte  to 
the  input  data  lines  and  than  the  next  byte.  Simul- 
teneously,  the  interface  must  load  the  first  output 
image  data  into  one  byte  of  the  16-blt  output  data 
register  and  then  the  next  output  data  into  the 
other  byte  of  that  register.  Finally,  the  output 
data  register’s  contents  are  transferred  to  the 
DRUB  which  writes  it  to  main  mamory.  K  preliminary 
schematic  of  a  jystem  that  can  perform  these  types 
of  transfers  in  shown  in  Figure  14. 

3.  SUMMARY  AND  FUTURE  WORK 

We  have  described  the  work  undertaken  to  design 
VLSI  processors  for  these  widely  used  systems: 
line-finding,  texture  classification,  and  segmenta¬ 
tion.  From  this  work,  we  believe  we  ran.  If 
required,  build  the  necessary  hardware.  However,  of 
greater  Impact.,  we  have  identified  and  started  to 


Figure  13.  Cirrrclai/Custoa 
UNIBUS-Proreaaor  Interface 


build  a  fully  software-programmable  low-level  pro¬ 
cessor  for  5x5  operations.  The  circuitry  described 
relies  on  s  special-purpose  VLSI  chip  with  6,500 
components.  Using  this,  and  the  Interface  designed 
to  hook  the  processor  to  coamerclsl  general-purpose 
machines,  most  low-level  arithmetic  operations  over 
a  5x5  kernel  can  be  performed.  This  work  will  con¬ 
tinue  and  the  full  aystam  will  ba  demonstrated 
using  our  in-house  FDP  11/34.  We  then  anticipate 
making  this  available  to  interested  government 
researcher h  in  this  field. 

Our  future  plans  Include  the  investigation  and 
possible  development  of  a  single  ulira-high-density 
circuit  to  Include  the  full  processor  and  the 
development  of  a  compatible  logic  processor. 
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Figure  14.  Bus  Structure  of  UNIBL'S-Proceaaor  Interface 
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Image  Understanding  Research  at  CMU 


Takeo  K  an  ad*  and  Raj  Reddy 


Computer  Science  Department 
Carnegie  Mellcn  University 
Pittsburgh,  PA  15213 


Image  Understanding  Research  at  CMU  mainly  concerns  three 
research  areas,  theory  tor  understanding  3  dimensional  shapes; 
integrated  system  demonstration  tor  photo  interpretation 
(database  and  interactive/automstic  image  interpretation 
techniques);  and  special  devices  and  computer  systems  tor  image 
understanding.  In  this  report  we  will  present  the  CMU  views  and 
our  recent  representative  progress  tor  the  lirsl  two  areas. 


Theory  for  Shape  Understanding 

Images  convey  the  shape  information  in  a  very  complicated 
manner.  Many  factors  are  interwoven  into  the  imaging  process: 
surface  property,  surface  orientation,  texture,  class  of  objects,  etc. 
They  individually  provide  constraints  on  the  shapes  that  the  image 
depicts.  Historically,  most  vision  programs  have  used  these 
constraints  in  a  vague,  unformalized  manner. 

At  CMU  we  believe  that  it  is  an  important  challenge  in  image 
understanding  to  formulate  theories  ol  shape  understanding  from 
images:  what  constraints  are  provided  by  individual  properties 
which  are  observable  in  the  image,  and  how  they  can  be 
aggregated  into  consistent  shape  interpretation.  We  have  been 
actively  investigating  geometrical  aspects  of  image  constraints  for 
extracting  shape  from  static,  monocular  images: 

•  Shape  recovery  from  line  drawings  (3)  (4) 

•  Stope  from  Texture  paradigm  [5] 

»  Mapping  image  properties  into  shape  constraints  (2] 

We  will  review  our  recent  progress  in  the  last  two  topics, 
together  with  the  research  on  3-D  shape  sensing  and  analysis. 


Shape  from  Texture 

Kender  finished  his  Ph.D  thesis  on  Shepe-from-Texture  (5).  Net 
oniy  did  the  research  produce  mar.y  interesting  results  on  texture 
analysis,  but  the  Shape-fromTexture  provides  a  computational 
pan.uigm.  In  the  Image  forming  process,  surfaces  are 
perspective^  projected  onto  two-dimensional  regions  of  the 
imaging  retina.  The  images  of  the  texture  constituents  wh  ch 
define  r-  surface  are  distorted  by  local  surface  orientation,  relative 
surface  distance,  and  the  characteristics  of  the  imaging  device. 
The  task  of  ,he  vir-oal  p.ocessor  is  to  decon volute  these  effects. 
Recovery  of  the  scene  characteristics  depends  on  simplifying 
assumptions  about  the  physical  work).  These  are  the  notions  of 
texture  regularity,  and  of  surface  opacity  ana  smoothness.  The 
general  paradigm  exploits  each  of  these  assumptions  in  the 
creation  and  refinement  of  the  analytic  framework. 

The  fundamental  conceptual  and  representational  tool  is  the 
normalized  textural  property  map  (NTPM).  Intuitively,  this  map 
relates  a  given  two-dimensional  image  texel  (texture  element)  to 
the  small  class  of  three-dimensional  constituents  which  may  have 
been  its  source  in  the  scene.  More  precisely,  it  is  a  way  ol 
deprojecting  the  affects  that  surface  orientation  has  on  primitive 
textural  properties  such  as  slope  in  the  image,  length  of  major  axis 
of  elongation,  etc.  The  map  summarizes  the  answers  to  the 
question,  "what  would  the  textura'  property  (e.g.,  real  slope,  real 
length,  etc.)  had  to  have  been  In  tt.e  scene  in  order  to  observe  the 
given  textural  property  in  the  image?". 

The  NTPMs  for  more  than  one  primitive  texels  are  combined  or 
intersected  to  ’^rive  a  specific  surface  orientation.  Here, 
additional  heuristics  about  the  physical  world  ara  invoked. 
Typically,  they  are  the  continuity  of  constituents  (regularity),  the 
continuity  of  local  surface  orientations  (smoothness),  and 
repetition  of  identical  texel  constituents  (structural  texture). 
Figure  1  is  an  oxample  of  recovering  the  surface  orientation  ol  • 
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Figure  1 :  Recovery  of  a  surface  orientation  of  a  building  face 

(«)  A  fragment  of  a  digitized  image  of  a  building  face.  The  surface  orientation  is  approximately  (p,q)  ■  (0.«,0). 

(b)  The  edge  image. 

(e)  The  gradient  apace  accumulator  array  (portion  near  tp.q) » (0.0)).  The  aurfcce  orientation  of  tl«e  building  face  la  given  as 
the  Intersection  of  the  two  virtual  Hnes:  one  horizontal,  the  other  diagonal.  They  respectively  stem  from  the  vertical  paf*M 
lines  and  the  diagonal  parallel  lines  in  the  building  face. 


building  face:  it  uses  image  slopes  of  texets  and  an  assumption 
that  near  parallel  image  lines  are  parallel  in  the  scene. 

Following  are  some  of  the  most  illustrative  results  of  Kender's 
thesis: 


•  Since  most  images  are  taken  in  an  environment  pervaded  by  a 
force  that  strongly  orients  objects  with  respect  to  It,  certain 
heuristics  regarding  "up",  "horizontal",  and  other  gravity- 
based  terms  make  image  understanding  simpler.  For  example, 
under  perspective,  assuming  that  a  certain  direction  (usually, 
the  y  direction,  through  the  center  of  the  image)  is  vertlccl 
enables  the  orientation  of  the  assumed  ground  (>lane  to  be 


ZOO 


deter mined  by  using  •  single  near- vertical  tenet. 

TheJ’aradiam  Applied  to  Length  and  Spacing 

•  The  paradigm  is  applicable  directly  to  image  tenets  that  have 
linear  measure.  Actual  line  elements,  or  virtual  lira  elements 
(spacmgs)  are  analyzed  m  identical  ways 

•  The  general  problem  of  arbitrary  line  lengths  Is  tractable  In  all 
cases  under  orthography.  Out  only  under  special  cases  under 
perspeclive.  The  rotational  coupling  of  the  gradient  space 
ensures  that  the  orthographic  NTPM  has  only  one  tr*j 
parameter  (the  image  leng,h).  However,  the  perspective  NTPM 
has  four:  three  for  position,  and  t»oe  for  scale. 

•  Under  orthography,  the  assumption  ol  equal  length  in  the 
scene  reduces  to  the  case  ot  the  Kanyde  orientation  hyperbola. 

•  Under  perspective,  the  assumption  of  actual  scene  length  plus 
the  assumption  ot  parallelism  in  the  scent,  induces  a  simple 
graphic  construction  that  directly  gives  the  vanishing  line,  and 
therefore  (he  surface  orientation. 

•  Under  perspective,  the  assumption  ot  equal  scene  length  plus 
the  image  phenomenon  of  colmearity  create  a  one¬ 
dimensional.  straight-line  gradient  soace  constraint. 

•  The  colinoar-equal  assumption  gives  a  unique  vanishing  point 
that  can  be  determined  exactly  by  using  a  Hough  like 
accumulator  method  that  generates  parch  nlas  or  hyperbolas  in 
a  transform  space. 

Ihe  Eaiiistiam  Applied  to  Area  and  Density 

•  The  NTPM  of  density  under  orthography  is  identical  to  the 
reflectance  map  of  a  Lambertian  surface.  This  gives  strong 
th.'.irelc  support  to  the  popular  method  t,t  olurripg  textures 
into  snader  ot  grey  for  scene  analysis  purposes. 

•  Under  perspective,  the  blurring  ol  textures  is  hazardous,  as,  it  is 
crucially  dependent  on  second  order  distribution  statistics. 

It la  Paradigm  Applied  to  Intensity  and  Contour 

•  Texture  and  illumination  share  many  parallels.  Must  often,  the 
two  phenomena  (shading  and  pattern)  are  intermingled.  Ot  the 
two.  texture  appears  more  "robust". 

•  Shading  can  also  be  analyzed  under  perspective.  The  simplest 
case  ot  shading  (the  Lambertian  sphere)  has  the  same  analysis 
under  perspective  as  it  does  under  orthography. 

•  The  constraint  curves  generated  by  some  shape-!. om- 
occludlng-contour  methods  are  identical  to  both  thosg  used  in 
shape-from-shading  and  those  u?ad  in  shape-from-texture.  All 
three  have  the  same,  simple  graphic  interpretation. 

The  r  i ussian  Sphere  As  the  Preferred  Representation 

•  The  gradient  space  is  deHcx  :  in  that  it  is  only  half  a  space;  the 
Gaussian  sphere  is  a  natural  extension  with  a  number  c! 
preferable  properties  [6j.  However,  the  surface  of  a  sphere  is 
hard  to  handle  in  a  straightforward  way  in  a  planar  computer. 

•  The  Gaussian  sphere  preserves  many  ol  the  pleasing 
properties  of  the  gradient  space. 

•  Other  representations  may  be  occasionally  usolul:  the  inverted 
gradient  space  ((t/p,1/q)  instead  ol  (p.q)),  for  example. 
Especially  intriguing  is  the  problem  of  how  to  simplify  several 
half-angle  formulae. 


Mapping  image  properties  into  shape 
constraints 

Certain  image  properties,  such  as  parallelisms,  symmetries,  and 
repeated  patterns,  provide  cues  for  perceiving  3  D  shape  from  a  2 
D  picture  Kanade  and  Kenrter  |3)  demonstral.xl  how  we  can  map 
those  image  properties  into  3  D  shape  constraints  by  associating 
appropriate  assumptions  with  them  and  by  using  appropriate 
computational  and  representational  tools 

Skewed  Symmetry 

One  representative  example  is  ihe  concept  of  skewed 
symmetry,  a  property  of  2-D  shapes  in  which  the  symmetry  is 
found  along  lines  not  nocessarily  perpendicular  to  the  axis  o! 
symmetry,  but  at  a  fixed  angle  to  it.  Formally,  such  shapes  can  be 
defined  as  2-D  affine  transforms  ol  real  symmetries.  (Figures  2 
(a)(b)(c)  show  a  few  examples.)  There  is  a  good  body  of 
psychological  experiments  (12)  which  suggests  that  human 
observers  can  perceive  surface  orientations  from  figures  with  this 
property.  This  is  probably  because  such  qualitative  symmetry  in 
the  image  is  olten  due  to  real  symmetry  in  the  scene.  Let  us 
associate  the  following  assumption  with  this  image  pioperty:  "A 
skewed  symmetry  depicts  a  real  symmetry  viewed  from  some 
unknown  view  angle. " 

We  can  map  ihis  assumption  into  the  constraints  on  surface 
orientation.  Let  G  »  (p,q)  denote  'he  gradient  ot  the  plane  which 
includes  the  skewed  symmetry,  aid  a  and  p  denote  the  2-D 
directions  ol  the  skewed  symmetry'  i  axes.  It  can  be  shown  that 
such  a  skewed  symmetry  in  the  picture  can  be  a  projection  of  a 
real  symmetry  it  and  only  if  the  gradient  is  on  the  hyperbola  shown 
in  Figure  3,  which  is  g*ven  by: 

COS(a-/l)  ♦  (pcosa  +  qsin«)(pcos/?  +qsin/))-0 

Kanade  [4]  has  shown  that  this  constraint  plays  important  roles 
in  perceiving  quantitative  shapes  from  line  drawings  of  polyhedra. 
This  quantitative  shape  recovery  is  an  essential  advance  from 
Huffman  Clowes- Waltz  type  labeling,  which  simply  could 
characterize  shapes  of  polyhedra  qualitatively  (convex  or  concave 
edges,  etc). 

Mine-Transformable  Patterns 

A  more  general  class  Of  image  properties  that  Kanade  and 
Kender  [2]  have  found  is  affine- Transformable  patterns.  !n  texture 
analysis  we  of* on  consider  small  patterns  (texef)  by  whose 
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Figure  2:  Skewed  symmetry 

A  skewed  syrvnetiy  defines  two  directions  in  the  imAgfe  skewed  symmetry  Axis 
(a)  snd  skewed  symmetry  tr*nsverse  axis  (/?)  The  skewed  symmetry  AASomphon 
sssumes  ttwt  they  are  peraendiculer  m  the  scene 


Figure  3;  The  hyperbola  determined  by  a  skewed  symmetry 

The  Axis  of  tho  hyperbola  a  the  bisector  of  the  obtuse  angle  made  by  a  end  fi. 
The  osymtotes  make  the  acme  angle  as  the  acute  angle  made  by  a  and  fi , 


repetition  a  texture  is  defined,  Suppose  we  have  a  pair  of  texei 
patterns  in  which  one  is  a  2-D  affine  transform  of  the  other;  we  call 
them  a  pair  of  atline-transformabie  Patterns.  Let  us  assume  that 
"A  pair  ot  alline  transformable  patterns  in  the  picture  are  the 
protection  ot  simitar  patterns  in  the  3-D  space  (i  e„  they  can  be 
overlapped  by  scale  change,  rotation,  and  translation)".  The 
above  assumption  can  be  schematized  by  Figure  4, 


Then  we  can  derive  a  relationship  between  the  gradients  Q,  and 
G2  of  the  scene  constituents  of  the  texet.  This  constraint  la 
determined  solely  by  the  matrix  A,  which  Is  determined  by  the 
relation  between  P(  and  Ps  which  Is  observable  w  the  picture: 
without  knowing  either  the  origins!  patterns  (P*,  and  P'j)  or  their 
relationships  (o  snd  R)  in  the  space. 

The  constraints  by  me  affine  transformable  patterns  can  be 
used  for  shape  recovery  of  an  object,  like  Figure  S,  on  whose 
surface  a  number  of  patterns  are  printed  or  stamped  such  a» 
textile  or  wall  papers.  The  assumptions  we  used  tor  the  skewed 
symmetry,  the  affine-transformable  patterns,  and  texture  analysis 
can  be  generalized  as:  "Properties  observable  in  the  picture  an 
not  by  accident,  but  are  projections  of  soma  preferred 
corresponding  3D  properties."  This  provides  a  uaeful  meta¬ 
heuristic  for  exploiting  image  properties:  we  can  call  it  tl«  meta¬ 
heuristic  of  non  accidental  image  properties.  Methods  of 
aggregating  many  local  constraints  into  consistent  shapes  are 
being  further  studied. 

3-D  shape  sensing  and  analysis 

At  CMU,  optical  noncontact  ranging  devices  for  medium  (50  cm) 
and  short  (5  cm)  range  are  being  developed  [l]  Doth  use  analog 
position  sensor  chips  to  detect  the  location  ot  an  intensity  spot 

which  is  projected  on  the  object  surface  by  a  light  source.  Unlike 
vidicon  or  CCD  array  sensors,  scanning  the  field  of  view  is  not 
necessary  The  chip  outputs  currents  which,  with  a 
computation,  tell  the  location  of  the  spot  This  can  provide  simple, 
fast,  accurate,  noncontact  visual  sensing  ot  range  information. 
Prototype  devices  were  constructed  for  both  medium  and 
proximity  ranging  devices.  They  ere  being  tested  presently  and 
will  be  used  mainly  for  robotic  applications.  Wc  expect  the 
performance  ot  10,000  points/sec.  and  1.000  linu  resolution.  In 
parallel  with  the  development  of  rsn-g  ng  devices,  we  are 
developing  programs  to  reliably  extract  local  surface  orientations 
from  range  data.  Based  on  the  theoretical  results  ot  Kendc r.  we 
are  looking  at  the  Gaussian  sphere  representation  tor  ',.ject 
surface  orientations  using  the  sinusoidal  mapping  of  a  sphere  to  a 
2  D  plane. 
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Fl^ur  ]  4-.  f.  uuhomst'c  diagram  showing  th«  assumption  on  ths  Afttns  transformable  pattscr* 

Conaidar  tag  *»<#,  p*\tan<a  P,  end  Pj  In  «w  pfrbMd.  and  p  ea  tha  ortoina  al  tha  »y  coordmawa  a  than  cantara, 
raapacPvat*  Ilia  v«  i’»n  homPj  lu  P(  can  txi  axpraaaad  by  I  i«gutar  2*2  maria*  A.(a^)  P,  and  Pj  rt  protacbona  d 
panama  P  ana  P  w(  lei.  vu  drawn  no  >.»  3  0  auOacaa  Wa  aaauma  (bat  P1  and  P’j  arc  anal  enough  ao  that  <m  can 
report  them  *,  exv  <  d>awn  on  jnoa  plane*.  La*  ua  danoM  ma  gradients  a*  those  amot  planes  by  0,  *  (P,*,)  and 
Uj  « (Pj.Qj),  mapectlwaty,  la ,  P ,  *  diawn  on  a  plana  -z  »  p^a  ♦  q1y  and  Pj  on  -a  •  Pj»  ♦  qjy. 
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•  Terrain  Data 


Integrated  System  of 
Database  and  Photo  Interpretation 


o  DLMS  Digital  Terrain  Data  File.  100  m  x  100  m/p4xel,  1  m 
resolution  In  altitude 


A  system  for  photo  interpretation  tasks  needs  more  the.i  simple 
linage  interpretation  techniques.  Newly  acquired  images  have  to 
be  assimilated  into  the  system ,  compared  with  existing  Images  and 
meps,  and  interpreted  by  using  the  existing  relevant  information  oa 
knowledge.  The  extracted  information  then  updates  the  database. 
At  CMU,  we  are  currently  developing  a  demonstration  system 
MAPS  which  inter”  ues  image  database,  map  representation, 
Interactive  image  n  .  nipulation  and  display,  and  automatic  image 
analysis  [9]  (8).  Figure  6  shows  the  present  configuration  of 
MAPS. 


Multi  sensor,  multi  data-type  image 
database 

We  currently  have  multi-data  type,  multi-sensor  image  database 
for  the  Washington  D.C.  area  bounded  by  <North  38°,  West  76°> 
and  <North  39°,  West  78°>.  It  includes: 

•  Sensory  Images 

o  airborne  monochromatic  (BW)  2048x2048 

o  Skylab  color  (COL)  2300x2300 

o  airborne  color  infrared  (CIR)  2200x2200 

o  Landsat  muttispeciral  (MSS)  3200x2400 
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Figu  re  5:  A  picture  ol  a  bail  with  identical  patterns  stamped 


•  Map  Data 

□  DLMS  Cultural  Feature  Dnta  File:  approx.  18,000  feature* 
(point,  linear,  area) 

•  Symbolic  Map  Representation 

o  We  are  developing  a  symbolic  representation  of  map  (2D 
and  30):  interconnection  ol  toads,  buildings  (shape, 
height),  etc. 


ManipuIMfcn,  generation  and  display  of 
Images 

Over  the  past  year,  we  have  implemented  a  large  number  of 
facilities  to  manipulate,  to  generate,  and  to  display  images  using 
tho  database.  The  representative  capabilities  ore:  (See  (8)  for 
details) 

Drowse: 

Tliis  is  a  very  general,  flexible  multi  window  image  display 
using  a  Grinnell  color  display. 

Hand  Segmentation: 

An  image  can  be  segmented  interactively,  and  the  results  are 
Stored  as  part  of  the  image  description  file. 

Terrain  Data  Manipulation  and  Display: 

Digital  terrain  data  can  be  manipulated  and  displayed  to  show 
3-D  representation,  contours  and  3-D  features. 

Landmark  Extraction: 

The  landmark  tile  contains  names,  descriptions  and  image 
chips  of  a  large  number  of  landmarks  within  the  task  area 
Knowing  grossly  the  area  that  a  new  image  covers,  a  user  can 
quickly  locate  landmarks  in  it  in  order  to  establish  a 
correspondence  between  imsge  and  mop. 

Correspondence  Coefficients  Computation : 

This  program  computes  coefficients  of  the  first,  second  and 
third  order  polynomial  transforms  between  map  and  image. 

Intervisibility  Map: 

This  is  being  developed  as  one  of  the  planned  CMU 
contributions  to  the  Testbed.  Moravec  [10]  is  developing  a 
program  which  will  be  able  to  rapidly  generate  views  of  throe¬ 
dimensional  scenes  described  by  large  numbers  of  planar 
faces.  The  techniques  used  can  represent  the  effect  of 
shadows  and  produce  intervisibility  maps  as  special  cases. 

His  method  relies  on  a  new  hidden  surface  removal  algorithm 
of  Fuchs  which  generates  views  in  linear  time.  (A  somewhat 
more  expensive  view- independent  presort  must  be  done  Just 
once  for  the  scene.)  It  generates  intermediete  images  which 
contain  high  resolution  views  of  the  scene,  but  which  ont/ 
indicate  which  face  is  visible  at  each  pixel,  not  its  brightness. 
Becau'5-'’  these  face  identity  pictures  consist  of  large 
conti  is  region^  with  the  -aine  value  they  can  bo 
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represented  very  efficiently  in  quad  tret  structures  which 
recursively  subdivide  along  edges,  but  represent  large 
constant  areas  as  single  nodei.  The  cost  of  both  computing 
and  storing  the  (ace  identity  quad  trees  grows  as  O(nlogn) 
with  the  picture  resolution  (and  edge  lengtn).  allowing  very 
high  resolutions;  the  cost  of  a  conventional  raster  grows 
quadratically  with  the  resolution. 

Image  analysis  and  Interpretation 

Sanmantat/on 

A  region  segmentation  program  [1 1 1  has  been  raimplemented  In 
the  C  language  on  a  VAX  1 1  /780  under  UNIX.  This  is  also  one  of 
our  Testbed  contributions  and  will  soon  be  delivered  to  SRI.  This 
segmentation  program  has  all  the  features  of  the  Ohlander-Price- 
Shafer  segmentation  method.  In  addition,  it  will  include  enriched 
capabilities  for  manual/automatic  image  analysis  (region  analysis, 
region  representation,  histogram  evaluation,  threshold  selection). 

.'maae  Registration  and  Stereo  Analysis 

Lucas  (7]  presents  an  iterative  image  registration  technique,  a 
type  of  Newton-Ralphson  iteration,  that  uses  spatial  Intensity 
gradient  information  to  direct  the  search  for  the  position  of  the 
best  match.  This  technique  takes  advantage  of  the  fact  that  in 


many  applications  the  two  images  are  already  In  approximate 
registration.  It  can  be  generalized  to  deal  with  arbitrary  linear 
distortions  of  tho  Image,  including  rotation. 

An  analysis  on  convergence  suggests  that  convergence  le 
obtained  If  the  initial  estimate  is  within  a  distance  of  one  half  of  the 
size  of  the  object  Range  of  convergence  car  be  expanded  by 
first  smoothing  the  image,  in  fact,  since  frequency-limited  images 
(low-pass  or  hand-pass  filtered)  can  be  sampled  nt  lower 
resolution,  we  can  adopt  a  coarse-fine  strategy  together  with  this 
iterative  registration. 

This  Iterative  registration  algorithm  with  coarse  line  strategy 
waa  applied  to  stereo  matching  problem.  Using  the  oome 
principle,  iterative  formulas  were  derived  for  both  the  disparity 
for  the  camera  model.  They  were  tested  with  'eel  Images. 

Feature  Extraction 

Automatic  stereo  matching  and  Interpretation  of  urban  scenes 
include  difficult  and  new  problems.  In  stereo,  for  example,  large 
discontinuities  of  disparity  must  be  explicitly  taken  into  account. 
For  this,  it  is  important  first  to  extract  cultural  features,  such  as 
edges  and  corners  of  buildings.  Stereo  matching  can  be  dona 
using  those  features.  Photographs  of  urban  areas  have  common 
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FipureB:  Present  configuration  of  MAPS 
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characteristic*,  they  include  a  targe  ot  number  oi  linear  feature*; 
there  a  lew  dominant  orientations  ol  the  linear  features  due  to 
the  tact  that  buildings  and  roads  are  mostly  aligned  and 
rectangular;  and  the  features  tend  to  bo  small  and  cluttered  (e  g  , 
two  parallel  lines  can  be  very  close). 

To  increase  the  accuracy  ol  edge  operators,  their  window  size 
could  be  enlarged,  be!  this  often  results  in  failure  to  detect  tiny 
features  Deviation  of  the  detected  orientations  ol  edges?  by  a  3x3 
Sob«>l  operator  from  true  orientations  was  carefully  analyzed.  The 
result  Is  shown  in  Figure  7.  It  illustrates  how  the  operator  gives 
better  accuracy  when  applied  to  edges  aligned  with  the  picture 
axes.  However,  those  directions  are  not  Known  beforehand,  and 
vary  from  part  to  part  ol  ihe  image. 

First,  the  image  is  divided  into  overlapping  small  areas  (about 
100x100)  The  6  histogram  ol  the  Hough  transform  on  each  small 
area  detects  dominant  orientations  ol  edges  in  that  area.  For  each 
orientation,  we  recompute  the  moor  precise  edge  orientation  by 
locally  rotating  the  images.  Then,  the  p  histogram  is  taken  using 
the  revised  #'a  to  separate  individual  line  features.  This  method 
allows  the  extraction  ol  very  close  parallel  lines,  such  as  the  case 
that  one  upper  roof  edge  almost  overlaps  another  lower  roof  edge. 
We  are  continueing  this  effort  toward  precise  extraction  ot 
Important  micro  features  in  urban  scenes.  Such  an  extraction 
procedure  will  form  a  basis  (or  stereo  matching  and  interpretation 
of  urban  scenes. 


p  *  xcos#  v  ysintf 

Ax  ■  (C+2F+1)  -  (A+2D+G) 
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Figure  7:  Analysis  ol  the  error  in  the  edge  orientation 
measured  by  a  ox3  Sobel  operator 

A  simple stop  edgs  is  assumed  to  pass  ttw  canpal  pue*  The  display  shows  how 
the  atoi  |*  S\  deiwnds  on  p  and  if  NaturoXy.  it  is  amoi.  wt  js  the  ja  pasaaa 
the  ttxw  liotitonial  (ot  vetTcal)  pixel* 
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Conclusion 

At  CMU  we  are  continueing  to  develop  theories  ol  shape 
understanding.  As  wall  as  basic  theoretical  issues,  we  expect  that 
our  robotic  applications  will  stimulate  the  practical  aspects  ol  the 
theory.  We  expect  to  continually  extend  our  integrated  database- 
supported  photo  interpretation  system.  Demonstrai  on  of  how 
symbolic  map  tepresentation  can  successfully  interact  with  aerial 
p'’Oto  interpretation  (map-guided  interpretation  and  map  updates) 
is  one  of  our  main  goals.  For  the  Testbed  activities,  the  region 
segmentation  program  and  the  fast  generation  of  intervisibility 
map  will  be  soon  deliverable,  and  implementation  of  Moravec's 
stereo  on  VAX  wiM  begin. 
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ABSTRACT 

Current  activities  on  the  project  are  re¬ 
viewed  under  the  following  headings: 

1)  Segmentation 

2)  Local  feature  detection 

3)  Feature  linking 

4)  Hierarchical  representation 


1.  INTRODUCTION 

This  project  is  concerned  with  the  study  of 
advanced  techniques  for  the  analysis  of  reconnais¬ 
sance  imagery.  It  is  being  conducted  under  Con¬ 
tract  DAAG-53-76-C-0138  (DARPA  Order  3206) ,  moni¬ 
tored  by  the  U.S.  Army  Night  Vision  and  Electro- 
Optics  Laboratory,  Ft.  Belvoir,  VA  (Dr.  George 
Jones).  The  We&tinghouse  Systems  Development 
Division,  under  a  subcontract,  is  collaborating 
on  implementation  and  application  aspects. 

The  previous  phase  of  the  project,  entitled 
"Image  Understand!  Using  Overlays",  was  conclu¬ 
ded  during  the  past  reporting  period.  Accomplish¬ 
ments  under  this  phase  are  summarized  in  a  Final 
Report  dated  May  1980  [1] ,  which  also  contains  a 
bibliography  of  all  reports  and  papers  produced 
during  this  period. 


The  current  phase  of  the  project  is  concerned 
with  three  principal  areas:  (a)  comparative 
analysis  of  segmentation  techniques  applied  to 
FLIR  imagery;  (b)  development  of  an  inference- 
based  approach  to  target  detection  on  FLIR  imagery; 
and  (c)  optical  flow  analysis  of  time-varying  ima¬ 
gery.  Work  in  area  (b;  is  in  progress  and  will  be 
described  in  forthcoming  technic  il  reports.  Area 
(a)  has  emphasized  methods  based  on  hierarchical 
("pyramid")  image  representations,  some  of  which 
are  reviewed  in  this  report  and  in  two  separate 
papers  in  these  Proceedings  [2,3).  Other  sepa¬ 
rate  papers  [4,5]  deal  with  some  of  the  work  done 
in  area  (c).  In  addition,  the  project  is  preparing 
software  contributions  to  the  DARPA/DMA  Image 
Understanding  Testbed;  the  first  of  these  will  be 
a  general-purpose  software  package  for  implementing 
relaxation  processes  at  the  pixel  level. 

This  report  reviews  activities  on  the  project 
during  the  period  April  1980-January  1981.  This 
work  is  covered  under  the  headings  of  segmentation; 
local  feature  detection;  feature  linking;  and  hier¬ 
archical  representation.  The  work  is  summarized 
only  briefly,  since  it  is  covered  in  greater  detail 
in  individual  technical  reports  and  Image  Under¬ 
standing  Workshop  papers. 

2.  SEGMENTATION 

2 . 1  Color  pixel  classification 
When  pixelr  in  a  black-and-white  image  are 
classified  by  tnresholding  their  gray  levels,  gra¬ 
dient  magnitude  information  can  be  used  in  various 
ways  as  an  aid  in  threshold  selection.  In  parti¬ 
cular,  a  histogram  of  the  gray  levels  of  pixels 
whose  gradient  magnitudes  are  low  has  sharper  peaks 
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and  deeper  valleys  than  the  histogram  of  the  entire 
image,  since  the  low-gradient  pixels  tend  to  come 
from  the  interiors  of  regions,  not  from  region  bor¬ 
der  zones',  it  is  easier  to  choose  useful  thresholds 
(at  valley  bottoms)  from  this  Improved  histogtuct. 
Analogously,  when  pixels  in  a  color  or  oultispec- 
tral  image  are  classified  on  the  basis  of  their 
spectral  signatures,  the  color  gradient  magnitude 
can  be  used  as  an  aid  in  defining  decision  surfaces 
that  separate  clusters  of  pixels  having  like  sig¬ 
natures.  In  fact,  a  scatterplot  of  the  signatures 
of  pixels  whose  color  gradient  magnitudes  are  low 
has  more  clearly  separated  clusters  than  the  scat¬ 
ter  plot  of  the  entire  image,  for  the  same  reason 
as  in  the  grayscale  cas?.  This  phenomenon  is 
illustrated  in  Figure  1.  ’  F-.  r  further  details  and 
additional  examples,  see  (6). 

2. 2  Mosaicking 

When  aerial  photographs  are  combined  into  a 
photomosaic,  seams  are  often  apparent  between  the 
parts.  These  seams  are  caused  by  gray  level  dif¬ 
ferences  due  to  the  different  conditions  under 
which  the  parts  were  recorded.  A  relegation  me¬ 
thod  has  been  developed  that  gene  rates  a  gray- 
level  correction  function  such  that,  when  this 
function  is  subtracted  from  the  mosaic,  the  seams 
are  eliminated,  but  the  details  of  the  photo¬ 
graphs  are  not  affected.  The  algorithm  does  not 
assume  any  specific  types  of  gray  level  differ¬ 
ences  among  the  parts,  nor  does  it  require  the 
existence  of  overlaps  between  the  parts,  and  it 
can  be  used  for  arbitrary  numbers  of  parts;  but 
it  does  have  the  drawback  that  if  a  seam  coincides 
with  an  edge  between  two  regions,  that  edge  will 
be  eliminated.  The  algorithm  constructs  a  seam- 
ellmlnatlng  function  which,  when  subtracted  from 
the  mosaic,  causes  the  gray  levels  at  pairs  of 
adjacent  points  on  opposite  sides  of  a  seam  to 
become  equal,  and  which  otherwise  is  as  smooth 
as  possible.  An  example  of  mosaic  seam  elimina¬ 
tion  using  this  algorithm  is  shown  in  Figure  2. 
Other  examples,  and  further  details,  can  be.  found 
in  [71- 


3.  LOCAL  FEATURE  DETECTION 

3.1  Hlgher-crder  edge  detectors 

One  way  to  define  edge  detectors  for  digital 
Images  is  to  fit  a  polynomial  surface  to  a  neigh¬ 
borhood  of  each  pixel,  and  take  the  magnitude  of 
the  gradient  of  that  surface  as  an  estimate  of 
edgeness.  The  polynomial  fitting  process  is 
usually  carried  out  for  synmetric  neighborhoods, 
using  polynomials  of  degree  1  or  2.  Using  least 
squares  fitting  by  orthonormalization  and 
Richardson  extrapolation,  one  can  calculate  such 
edge  estimates  for  other  classes  of  neighborhoods, 
and  for  higher-order  polynomial  models  [8] ■ 

As  a  further  application  of  this  approach, 
edge  detectors  can  be  defined  based  on  least- 
squares  surface  fitting  in  which  the  surface  is  a 
step  edge  superimposed  on  a  low-order  polynomial 
function.  This  makes  it  possible  to  "filter"  op¬ 
timal  step-based  operator  responses  so  as  to  dis¬ 
criminate  against  noise  responses,  by  rejecting 
responses  for  which  the  fit  is  poor,  without  dis¬ 
criminating  against  low-contrast  edges  (which  is 
unavoidable  If  thresholding  is  used  for  noise 
suppression).  An  example  of  such  edge  "filtering" 
is  shown  in  Figure  3.  For  other  examples,  and 
further  details,  sue  [9]. 

3. 2  Edge  evaluation 

A  method  of  evaluating  edge  detector  output  has 
been  developed,  Imsed  on  the  local  good  form  of 
the  detected  edges.  It  combines  two  desirable 
qualities  of  well-formed  edges  —  good  continua¬ 
tion  and  thinness.  The  measure  has  the  expected 
behavior  for  known  input  edges  as  a  function  of 
their  blur  and  noise.  It  yields  results  generally 
similar  to  those  obtained  with  measures  b-vse-J  on 
discrepancy  of  the  detected  edges  from  their  known 
ideal  positions,  but  it  has  the  advantage  of  not 
requiring  ideal  positions  tc  be  known.  It  can  be 
used  as  ar.  aid  to  threshold  selection  in  edge 
detection  (pick  the  threshold  that  maximizes  the 
measure) ,  as  a  basis  for  comparing  the  performan¬ 
ces  of  different  detectors,  and  a3  a  measure  of 
the  effectiveness  of  various  types  of  preproces¬ 
sing  operations  facilitating  edge  detection. 
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This  method  is  described  in  detail  in  a  separate 
paper  in  tnese  Proceedings,  where  examples  of  its 
performance  are  also  given  [10] . 

4.  FEATURE  LINKING 

4.1  Edge  segment  linking 

A  system  of  programs  that  links  edge  segments 
based  on  both  gray  level  and  geometric  criteria  has 
been  developed  and  applied  to  the  detection  of 
buildings  and  roads  on  aerial  photographs.  Pre¬ 
liminary  results  using  these  programs  were  de¬ 
scribed  in  [11];  a  more  detailed  description,  and 
numerous  additional  results,  are  presented  in  [12], 
Further  work  along  these  lines  led  to  the  deve¬ 
lopment  of  figures  of  merit  for  linking  compa¬ 
tible  segments  (i.e.,  segments  that  could  be  con¬ 
secutive  sides  of  an  object)  and  antiparallel  seg¬ 
ments  (i.e.,  segments  that  could  be  opposite 
sides).  For  compatible  pairs,  the  figure  of  merit 
is  based  on  the  geometrical  configuration  of  the 
segments,  the  similarity  of  the  gray  levels  on 
their  "object"  sides,  and  the  similarity  between 
their  object  sides  and  the  line  joining  their  end¬ 
points.  For  antiparallel  pairs,  it  is  based  on 
the  homogeneity  of  gray  level  between  the  edges 
and  the  amount  of  overlap  between  them.  These 
figures  of  merit  have  highly  bimodal  histograms, 
making  it  quite  easy  to  decide  which  pairs  of 
segments  should  be  linked,  as  illustrated  in 
Figures  4  and  5.  They  should  be  useful  in  the 
design  of  relaxation-like  schemes  for  classifying 
edge  segments.  For  further  details,  and  many  addi¬ 
tional  results,  see  [l3,14]. 

4. 2  Reconstruction  from  gray-weighted  medial 
axes 

A  method  of  defining  a  "min-max  medial  axis 
transformation"  (NMMAT)  for  grayscale  images,  based 
on  Iterated  local  MIN  and  MAX  operations,  vas 
described  (a  J  previous  report  [1].  This  transfor¬ 
mation  associates  with  each  pixel  a  vector  of  gray 
1  vel  Increments,  and  sxacc  reconstruction  of  the 
image  is  possible  from  these  vectors,  Moreover, 
good  app. oxjmatlons  to  the  image  can  be  recon¬ 
structed  using  only  the  strongest  components  of 


the  strongest  few  vectors.  A  few  Illustrations 
of  this  tiers  given  in  an  earlier  report;  further 
details  and  additional  examples  cun  be  found  In 
[15]. 

5.  HIERARCHICAL  REPRESENTATION 

Extensive  work  has  been  done  on  thla  project 
on  the  use  of  pyramid  and  quadtree  structures  for 
image  representation  and  processing.  Tha  work 
done  in  this  area  through  March  1980  was  sum¬ 
marized  in  [16].  In  this  section  we  briefly 
summarize  developments  In  this  area  during  the 
past  reporting  period. 

5.1  Quadtree-to-raater  conversion 

An  algorithm  for  converting  quadtree  represen¬ 
tations  of  binary  images  to  row-by-row  (e.g.,  run- 
length)  representations  was  described  and  partially 
analyzed  in  an  earlier  report.  More  recently,  e 
compere tive  study  and  complete  analysis  of  four 
such  algorithms  has  been  conducted  [17].  The 
simplest  algorithm  Is  a  straightforward  top-down 
approach  chat  visits  each  run  in  a  row  in  succes¬ 
sion  starting  at  the  root  of  the  tree;  the  other 
algorithms  proceed  in  a  manner  akin  to  an  inorder 
tree  traversal.  The  analysis  shows  under  whet 
circumstances  each  algorithm  la  preferable.  They 
have  all  been  shown  to  have  execution  times  pro¬ 
portional  to  the  sum  of  the  heights  of  the  blocks 
comprising  the  image. 

5.2  Quadtree-based  image  smoothing 

Two  methods  for  smoothing  an  image  using  quad¬ 
tree  approximations  to  the  image  have  been  de¬ 
veloped.  Ore  uses  the  sizes  of  the  leaves  in  the 
quadtree  to  determine  neighborhood  sizes  over 
which  to  apply  the  smoothing.  The  other  method 
maps  each  image  gray  level  1  into  the  gray  level 
j  into  which  i  moat  frequently  maps  when  we  replace 
the  level  of  each  pixel  by  the  level  of  the  quad¬ 
tree  leaf  to  which  it  belongs.  Results  obtained 
using  these  methods,  as  well  as  a  local  histogram 
peak  sharpening  method,  are  shown  In  Figure  6. 

The  second  quadtree-based  method  seems  to  give  the 
best  results.  Additional  examples,  and  detailed 
descriptions  of  the  methods,  can  bs  found  in  [18], 
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5.3  Edge  prajji  and  quadtraei 

An  adga  (or  curva)  pyramid  is  a  sequence  of 
successively  lower-resolution  versions  of  sn  Image, 
each  containing  a  summary  of  the  edge  information 
in  its  predecessor.  This  summary  includes  the 
average  edge  magnitude  and  direction  in  each 
"block"  of  the  higher-resolution  image,  together 
with  an  intercept  in  that  block  and  a  measure 
of  the  error  in  the  direction  estimate.  An  edge 
quadtree,  analogously,  is  a  variable-resolution 
representation  of  the  edge  or  curve  information 
in  the  given  image,  constructed  by  recursively 
splitting  the  image  into  quadrants  based  on  mag¬ 
nitude,  direction,  intercept,  and  error  informa¬ 
tion.  Advantages  of  these  representations  Include 
their  registration  with  the  original  image,  their 
ability  to  represent  many  edges  or  curves  in  a 
single  tree  structure,  and  their  ability  to  per¬ 
form  many  operations  on  the  represented  data  effi¬ 
ciency.  A  detailed  description  of  these  repre¬ 
sentations,  together  with  examples,  can  be  found 
in  a  separate  paper  in  these  Proceedings  [2]. 

5.4  Pyramid  linking 

When  an  image  is  smoothed  using  small  blocks 
or  neighborhoods,  the  results  may  be  somewhat  un¬ 
reliable  due  to  the  effects  of  noise  on  small  sam¬ 
ples.  When  larger  blocks  are  used,  the  samples 
become  more  reliable,  but  they  are  more  likely  to 
be  mixed,  since  a  large  block  will  often  not  be 
contained  in  a  single  region  of  the  image.  A  com¬ 
promise  approach  is  to  use  several  block  sizes, 
representing  versions  of  the  image  at  several  reso¬ 
lutions,  and  to  carry  out  the  smoothing  by  means 
of  a  cooperative  process  based  on  links  between 
blocks  of  adjacent  sizes.  These  links  define 
"block  trees"  which  segment  the  Image  Into  regions, 
not  necessarily  connected,  over  which  smoothing 
takes  place.  The  basic  "pyramid  linking"  scheme 
was  described  in  an  earlier  report.  Further  ex¬ 
periments  with  this  scheme  have  led  to  some  im¬ 
provements  over  the  original  method,  based  on 
better  ways  of  initializing  the  process  and  measur¬ 
ing  the  link  merit.  A  detailed  description  of 
these  experiments  and  their  reuults  can  be  found 


in  a  separate  paper  in  these  Proceedings  [3].  It 
has  also  been  found  that  forced-choice  linking  of 
blocks  to  Isrger  blocks  is  not  necessary;  one  can 
use  weighted  links,  recomputing  the  weights  at 
each  iteration,  and  it  turns  out  that  the  weights 
converge  to  0's  and  l's  as  the  process  stabilises. 
Generalizations  of  this  approach  to  image  features 
other  than  gray  level,  including  color  signatures 
and  textural  properties,  have  also  been  investi¬ 
gated  and  will  be  described  in  future  reports. 
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Figure  1.  Scatterplot  enhancement  by  suppression 
of  high-gradient  pixels. 

g  h  i 

key  to  parts:  d  e  f 

a  b  c 

a-b)  Two  bands 

c)  Scatterplot  of  (a)  vs.  (b) ,  log  scaled 
d-e)  Edge  responses  in  the  two  bands  (RMS 
Roberts  operator) 

f)  Color  edge  response:  RMS  of  (d)  and  (e) 

g)  Enhanced  scatterplot,  log  scaled;  pixels 
with  edge  responses  >  2  have  been  suppressed 

h)  Mask  showing  suppressed  pixels 

i)  Histogram  of  edge  responses,  log  scaled 


(a)  Surburban  scene 
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(c)  Compatibility  merits  for  pairs  of  segments 
in  (b) 


(b)  Edge  segments  extracted  from  (a) 


(d)  Pairs  of  segments  having  compatibilities 

>  2 


Figure  4.  Compatibility  merit  for  edge  segments 
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(a)  Antiparallelness  merits  for  the  seg¬ 
ments  in  Figure  4b 


a  1)  Input  image 

2)  Results  of  histogram  sharpening 

3)  Results  of  variable-neighborhood 
smoothing 

4)  Results  of  smoothing  using  most 
frequent  leaf  value 


(b)  Pairs  of  segments  having  ant;iparallel- 
ness  merit  >  2. 


Figure  5.  Antiparallelness  merit  for 
edge  segments 


Figure  6.  Quadtree-based  image  smoothing 
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In  this  series  of  Image  Understanding  Workshop 
Proceedings,  we  have  stressed  the  issue  of  representation.  In 
particular,  we  hare  described  the  development  by  Horn  and 
his  collaborators  of  the  reflectance  map  and  the  albedo 
image,  and  we  hare  described  the  work  of  Marr  and  his 
collaborators  using  the  primal  sketch,  the  2  1/2-D  sketch, 
and  axis-based  3-D  models  as  part  of  a  comprehensive 
theory  of  recognition. 

In  the  April,  1980  Proceedings,  we  reviewed  work 
on  texture  gradients,  zero  crossings,  and  atmospheric 
modelling.  We  introduced  Horn  and  Schunck's  work  on 
the  determination  of  optical  flow  fields  from  smoothly 
varying  brightness  patterns. 

Here  we  review  work  on  using  occluding 
boundaries  to  facilitate  the  computation  of  shape  from 
shading,  the  interpolation  of  smooth  surfaces  from  a 
discrete  set  of  points,  the  detection  and  perception  of 
motion,  vision  hardware,  and  the  geometric  relations  made 
explicit  in  the  full  Primal  Sketch. 

Introduction 

Practical  applications  of  Image  Understanding  require  the 
development  of  algorithms  to  perform  processes  such  as 
extracting  the  important  intensity  changes  from  a  scene, 
or  detecting  movement  in  an  image  and  interpreting  it  as 
motion  in  space.  One  approach  is  to  develop  algorithms 
that  are  tailored  from  the  start  to  a  particular  application 
domain.  An  alternative  is  to  understand  the  basic 
principles  underlying  each  such  module.  One  may  then 
be  in  a  position  to  apply  substantially  the  same  ideas  in 
situations  as  diverse  as  remote  sensing,  object  recognition, 
and  object  tracking.  We  have  followed  the  Utter  course. 
One  important  process  is  the  determination  of  surfaces 
from  images.  This  is  the  goal  of  stereo,  shape  from 
shading,  shape  from  texture,  and  shape  from  motion. 
Progress  on  this  problem,  including  the  interpolation  of 
smooth  surfaces  from  a  discrete  set  of  boundary  points  is 
•  recurrent  theme  in  the  current  report. 


Shape  from  shading  and  occluding  boundaries 

Horn  and  his  collaborators  have  devoted  considerable 
attention  to  computing  the  shape  of  a  visible  surface 
from  the  intensities  that  comprise  its  image.  The 
relationship  between  them  is  expressed  mathematically  by 
the  Image  Irradianre  Equation,  which  is  a  first  order 
partial  differential  equation  of  the  form 

l(x,y)=R(pfl). 

where  p  and  q  are  is  suitable  pair  of  parameters  that 
specify  the  local  surface  normal.  One  such  pair  is  the 
gradient  of  the  depth  function  z  from  the  observer  with 
respect  to  the  image  coordinates  x  and  y.  The  function  R 
encodes  the  reflectance  characteristics  of  the  surface  and 
the  distribution  of  light  sources,  both  of  which  are 
assumed  to  be  unknown  but  fixed  (see  Horn  1975,  Horn 
and  Sjoberg  1980).  Horn(1977)  introduced  the 
Reflectance  Map  which  associates  with  each  surface 
orientation  (p,q)  of  Gradient  Space  a  scaled  value  for  the 
image  intensity  R(p,q).  The  Reflectance  Map  has  proved 
to  be  a  vs’uable  representation  in  several  diverse 
applications  of  shape  from  shading  (Woodham  1978, 
Sjoberg  and  Horn  1980,  Silver  1980,  Ikeuchi  1981). 

The  earliest  algorithm  for  computing  shape  from 
shading  was  devised  by  Horn(1975).  It  exploited  the 
characteristic  strip  method  of  reformulating  a  single 
partial  differential  equation  as  a  set  of  five  ordinary 
differential  equations.  The  idea  is  to  determine  the  shape 
of  a  surface  by  computing  a  set  of  space  jcurves,  called 
characteristics,  which  are  everywhere  tangential  to  ths 
surface.  Pori«(l975,  197  .  )  derived  an  iterative  algorithm 
to  find  characteristic  strips.  It  starts  from  a  point 
(xQ,yQ)  in  the  image  at  whivh  the  surface  gradient 
(PQ^O)  is  ltnown-  The  *tep  from  (*n»yn)  t0 
(x(n+i),y(n+l))  i*  to  the  direction  of  the  normal  to  the 
iso-brightness  contour  passing  through  (pR,qn)  in  the 
Reflectance  Map.  Similarly,  the  incremental  change  in 
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gradient  from  (pn,q„)  to  (P(n+i)»q/n+i)>  along  the 
characteristic  is  in  the  direction  of  the  normal  to  the 
iso-brightness  contour  pasting  through  (*„.)•„)  in  the 
image 

One  problem  with  this  method  concerns  the 
choice  of  the  singular  imagr  point  (sq,yq)  required  to 
start  the  iterative  process  at  which  the  surface  gradient 
(P0^0>  »  determined  uniquely  by  the  intensity  data.  A 
further  problem  is  that  Korn’s  algorithm  depends  on  the 
asumption  that  the  underlying  surface  is  locally  convex 
at  the  singular  point.  Finally,  the  class  of  Image 
Irradiance  Equations  for  which  Horn’s  algorithm  works 
was  unknown.  (The  latter  question  has  recently  been 
answered  by  Bruss(1981).)  Consequently  research  was 
directed  to  discover  the  criteria  under  which  the  shape  of 
a  surface  is  uniquely  determined  by  an  image.  One 
suggestion  was  that  bounding  or  occluding  contours 
provide  such  conditions.  Along  such  contours  the  surface 
normal  can  be  computed  exactly  from  the  image. 
However,  occluding  contours  pose  a  problem  for  the 
gradient  parameterisation  of  local  surface  orientation, 
namely  at  least  one  of  the  gradients  p  or  q  is  infinite. 
Ikeuchi  and  Horn(1981)  propose  a  different 
parameterisation  o;I  surfa :«!  orientations  that  corresponds 
to  Stereographic  I'ntjccthw.  Whereas  Gradient  Space  is 
the  planar  projection  ol  the  Gaussian  Sphere  from  its 
center,  the  Stereographic  Projection  is  from  the  north 
(it  is  assumed  that  the  viewpoint  is  from  the  south 
pole  facing  the  center  of  the  Gaussian  Sphere), 

Ikeuchi  and  Horr.(1981)  note  some  additional 
problems  with  the  characteristic  strip  method  for  solving 
the  Image  Irradiance  Equation.  First,  since  the  method 
proceeds  unidirectionally  along  the  strip,  the  method 
cannot  exploit  boundary  conditions  at  both  ends  of  the 
strip.  Second,  the  build  up  of  numerical  errors  along  any 
individual  strip  can  be  substantial.  The  alternative 
method  devised  by  Ikeuchi  and  Horn  is  to  formulate  a 
smoothness  condition  in  te^ms  of  the  Stereographic 
parameterisation  and  use  it  as  the  basis  of  a  local  parallel 
computation  which  "fills  in"  local  surface  normals  from 
the  known  values  on  the  boundary  to  the  unknown 
interior.  The  resulting  algorithm  has  been  tnted  on  a 
variety  of  images  and  works  well.  In  particular,  it 
appears  to  degrade  gracefully  as  errors  are  introduced  to 
the  placement  of  the  light  source,  the  surface  orientation 
on  the  boundary,  and  the  nature  of  the  reflectivity 
assumed  for  the  surface.  Strong  empirical  evidence  is 
provided  that  the  algorithm  converges,  although  no  proof 
is  demonstrated.  In  case  the  occluding  contour  ia 
partially  incomplete,  Ikeuchi  and  Horn’s  algorithm  still 


appears  to  converge,  though  it  it  not  known  at  how  many 
points  it  it  necessary  to  specify  the  (Stereographic 
parameterisation  of  the)  surface  normal. 

Brutt(1980)  hat  recently  studied  some  of  the 
mathematical  properties  of  the  image  Irradiance 
Equation.  First,  the  has  shown  that  discontinuous 
solution  surfaces  can  arise  from  a  continuous  Image 
Irradiance  Equation.  It  follows  that  one  cannot 
determine  for  a  continuous  Image  Irradiance  Equation 
whether  or  not  there  it  an  edge.  The  curvature  of  a 
surface  also  cannot  be  determined  from  its  image.  At  an 
example,  the  Image  Irradiance  Equation  *V  •p^-vq^  has 
two  different  solution  surfaces,  one  of  which  consists 
entirely  of  hyperbolic  points,  while  the  other  consists 
entirely  of  elliptic  points.  However,  Bruss  has  proved 
that  there  is  only  one  solution  that  it  convex.  She  has 
also  shown  that  bounding  contours  can  be  determined 
from  the  image  only  when  the  Image  Irradiance  Equation 
is  singular.  This  means  that  the  reflectance  function  R 
and  its  first  order  partial  derivative  are  continuous,  while 
the  intensity  function  I  is  singular  in  x  and/or  y.  For  any 
given  singular  Image  Irradiance  Equation  the  point,  on 
the  occluding  contour  can  be  found  by  inspection  of  the 
intensity  function  !(x,y). 

Bruss  has  studied  singular  "eikonal"  Image 

Irradiance  Equations  that  are  of  the  form  p^+q^«I(x,y). 
If  the  intensity  function  I(x,y)  vanishes  to  second  order  at 
the  singular  point,  that  is  to  say  has  the  form 
ax2+0xy+xy*+O(|x|5+|y|5),  then  there  is  exactly  on« 
positive  locally  convex  solution  surface  in  the 
neighborhood  of  the  singular  point.  This  result  is  applied 
to  show  that  if  there  is  a  dosed  bounding  contour,  the 
solution  surface  is  unique  (up  to  translation  along  the  x 
axis).  If  either  the  reflectance  function  is  not  p^+q^,  the 
intensity  function  does  not  vanish  precisely  to  second 
order,  or  there  is  not  a  smooth  closed  bounding  contour, 
there  is  not  a  unique  solution  surface. 

Progress  on  shape  from  shading  means  that  we 
can  compute  the  topography  of  the  visible  surfaces  from 
an  image  by  a  local  parallel  computation  which  is 
naturally  implemented  in  hardware.  It  exploits  reliable 
information,  and,  as  a  result  of  the  theoretical 
developments  sketched  above,  we  can  reasonably  predict 
the  behavior  of  the  algorithm  in  unpredictable  situations. 

The  detection  and  perception  of  motion 

In  the  last  workshop  Proceedings  we  sketched  an 
algorithm  to  determine  optical  flow.  Optical  flow  is  the 
distribution  of  velocities  of  apparent  movement  caused  by 
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smoothly  changing  brightness  patterns.  It  has  been  noted 
that  optical  flows  encode  rich  informntion  about  a  scene 
and  observer  motion,  and  it  has  been  suggested  that  this 
information  can  be  computed  from  Che  flow  field.  In 
particular,  it  has  been  proposed  that  optical  flow 
facilitates  object  segmentation,  computation  of  the 
parameters  of  the  observer's  own  motion  relative  to  the 
scene,  and  the  determination  of  visible  local  surface 
normals. 

For  the  most  part,  research  has  been  concerned 
with  interpretation.  It  is  generally  supposed  that  the  flow 
is  given,  that  it  is  somehow  computed  automatically  and 
sufficiently  noise-free  by  "velocity  sensitive  neurons"  in 
animate  visual  systems.  Horn  and  Schunck(1981)  have 
studied  the  generation  of  the  optical  flow  from  smoothly 
time  varying  brightness  patterns.  They  restrict  attention 
to  imaging  a  flat  surface,  uniform  incident  illumination, 
and  smoothly  varying  reflectance.  With  these 
assumptions,  the  image  brightness  at  point  (x,y)  at  time  t 
is  governed  by 

Exu+Eyv  *  -Et, 

where  (u,v)  is  the  optical  flow.  This  linear  constraint 
specifies  the  component  of  the  flow  normal  to  the 
brightness  gradient.  In  order  to  compute  the  component 
of  the  flow  along  iso-brightness  contours  a  further 
assumption  is  required.  Horn  and  Schunck  propose  a 
measure  of  the  smoothness  of  flow.  The  departure  from 
smoothness  and  the  error  due  to  changing  brightness  are 
combined  and  used  to  define  an  iterative  algorithm  for 
computing  the  flow  from  a  sequence  of  images. 

The  algorithm  works  well  on  synthetic  images, 
(specially  when  there  are  no  depth  boundaries.  It  also 
stems  to  give  good  resvlts  when  there  are  depth 
boundaries,  though  the  errors  in  the  flow  become 
significant  on  the  boundary.  Schunck  is  continuing  to 
develop  the  algorithm  to  make  it  more  generally  useful. 
It  is  already  clear  that  it  is  difficult  to  achieve  the  ideal 
noise-free  flow  fields  assumed  as  input  by  published 
interpretation  schemes.  Consequently,  we  shall  reconsider 
the  interpretation  of  flow  fields  generated  as  the  output 
of  Horn  and  Schunck's  algorithm. 

Generally,  progress  in  the  determination  and 
perception  of  motion  rests  on  the  isolation  of  useful 
representations,  since  motion  refers  to  changes  to  those 
representations.  For  example,  Ullman's(1978)  work  on 
the  correspondence  computation  followed  Marrvs(1976) 
discussion  of  the  Primal  Sketch.  Similarly,  the 
determination  of  optical  flow  sketched  above  rests  upon 
Horn's  work  on  shape  from  shading  Progress  has  been 


made  on  understanding  motion,  based  on  our  work  with 
several  other  representations.  The  following  paragraphs 
summarise  our  efforts. 

Previous  workshop  Proceedings  have  reported 
our  work  on  zero  crossings.  Marr  and  U!lman(1980) 
suggested  that  the  time  rate  of  change  of 
S(x,y,t)"D^G(x,y)*I(x,y,t)  can  enable  one  to  detect  the 
direction  of  motion  of  zero-crossings.  The  practical 
importance  of  this  approach  is  that  in  attempting  to  track 
the  motion  of  objects,  it  seems  reasonable  to  find  the 
important  intensity  changes  and  find  out  what  they  are 
doing.  Let  T(x,y,t)  denote  the  partial  derivative  of 
S(x,y,t)  with  respect  to  time.  Marr  and  Ullman  consider 
the  response  of  S(x,y,t)  and  T(x,y,tj  in  the  vicinity  of  an 
moving  isolated  intensity  edge,  thin  bar,  and  wide  bar. 
They  show  for  example  for  an  edge  moving  to  the  right, 
T(x,y,t)  is  positive  at  the  zero  crossing,  while  for  motion 
to  the  left  it  is  negative.  Marr  and  Ullman  propose  that 
motion  to  the  right  can  be  detected  by  the  simultaneous 
activity  of  S+,  T+,  and  S'.  Here  S+  refers  to  the  positive 
component  of  S. 

Work  on  directional  selectivity  has  suggested  a 
possible  VLSI  implementation  of  the  D^G(x,y)  operator 
based  on  an  analysis  of  the  animate  retina.  Marr  and 
Ullman  find  dose  agreement  at  moderate  speeds  between 
their  theoretical  predictions  of  the  response  of  ganglion  X 
and  Y  cells  to  simple  moving  stimuli  and  actual  cell 
recordings  from  the  Physiology  literature  (see  Richter 
and  Ullman  1980,  figures  13  and  IS.)  Richter  and 
Ullman  have  recently  accounted  for  the  discrepancy  at 
high  speeds,  and  generally  refined  the  model  of 
directional  selectivity  by  noting  that  the  Gaussians, 
whose  difference  approximates  D^G(x,y),  act  like  RC 
filters  with  different  time  constants.  This  causes  a  slight 
delay  in  the  onset  of  the  negative  outer  part  relative  to 
the  positive  central  part.  Richter  and  Ullman's 
predictions  show  remarkable  agreement  with  cell 
recordings  for  a  wide  variety  of  stimuli. 

Motion  at  the  level  of  the  Primal  Sketch  and  up 
requires  careful  attention  to  coordinate  frames.  Brady 
and  Prazdny(1980)  showed  how  apparent  or  induced 
motion  can  be  explained  on  the  basis  of  eye  tracking  and 
local  coordinate  frames.  Local  coordinate  frames  were  a 
major  concern  of  Marr  and  Nishihara(1977)  in  their 
contributions  to  object  representations  based  on 
generalized  cones. 

Interpolation  of  carves  and  surfaces 

Many  of  the  visual  processes  discussed  here  and  in 


218 


previous  Proceedings  compute  the  shape  of  a  visible 
surface  by  finding  the  local  surface  orientation 
everywhere  within  its  boundaries.  This  includes  the  work 
of  Horn  and  his  colleagues  on  shape  from  shading,  and 
the  computation  of  shape  from  texture  investigated  by 
Stevens(1980)  and  Witkind  980).  On  the  other  hand, 
binocular  stereo  computes  disparity  at  the  discrete  set  of 
zero  crossings.  A  change  of  coordinates  can  convert 
angular  disparities  to  depths,  but  to  compute  the  local 
surface  normal  everywhere  on  a  visible  surface  it  is  first 
h'cessary  to  interpolate  a  zmooth  surface  from  the 
discrete  set  of  given  points.  Binocular  stereo  is  not  the 
only  module  which  generates  an  incomplete  surface 
orientation  map.  Stevens(1981a)  considers  the 
interpretation  of  surface  contours,  and  finds  that  they 
strongly  constrain  the  perception  of  the  underlying 
surface.  Horn(l982)  and  Marr(1978)  suggest  that  in 
addition  to  local  surface  orientation,  it  is  advantageous  to 
make  explicit  discontinuites  in  surface  orientation  and 
depth.  It  is  not  yet  clear  how  surface  normals  should  be 
parameterised,  nor  how  accurately  their  values  should  be 
represented.  Moreover,  substantial  advantages  are  likely 
to  accrue  from  attaching  texture  and  color  descriptors  to 
visible  surfaces,  but  the  details  are  as  yet  unclear. 

We  have  studied  the  interpolation  of  a  smooth 
surface  from  a  discrete  set  of  points.  One  possibility  is  to 
use  Coons  patches,  and  Bezier  and  Ferguson  surfaces 
developed  for  work  in  Computer  Aided  Design  (CA  D) 
and  Computer  Aided  Manufacture  (CAM)(see  Faux  and 
Pratt  1979.)  A  practic.il  difficulty  with  this  approach 
steins  from  the  spatial  irregularity  of  the  discrete  set  of 
boundary  values  (for  example  disparity  values  at  matched 
zero  crossings.)  Consequently  we  have  investigated 
surface  interpolation  using  what  we  know  about  human 
vision,  by  isolating  constraints  which  have  not  figured 
largely  in  the  development  of  CAD/CAM.  Essentially, 
two  such  constraints  have  been  uncovered,  and  are 
currently  receiving  attention. 

The  first  was  introduced  by  Grimson(1981). 
Suppose  that  a  smooth  surface  S  is  interpolated  from  a 
given  set  of  boundary  values.  Grimson  observes  that  the 
absence  of  a  boundary  value  at  (x,y)  means  that  the 
gradient  ot  S  cannot  change  too  rapidly  there.  Grimson 
has  coined  a  suggestive  slogan  for  this  analysis:  no  news 
is  good  news.  To  make  this  observation  precise,  Grimson 
notes  that  Horn’s  work  on  image  formation  enables 
conditions  to  be  derived  on  the  zero  crossings  that  would 
arise  from  the  surface  S. 

The  second  constraint  is  based  on  the  idea  that, 
in  the  absence  of  contrary  evidence,  the  human  visual 


system  constructs  the  most  conservative  curve  or  surface 
consistent  with  the  data.  We  are  able  to  interpolate 
smooth  curves  and  surfaces  without  involving  rich 
semantics.  It  also  seems  that  the  shape  of  the  boundary 
plays  the  most  significant  role  in  determining  the 
interpolated  surface  (see  for  example  figure  2-3  in  Barrow 
and  Tt-nenbaum  1981.)  Taken  together,  these  ideas 
suggest  that  the  interpolation  process  can  be  modelled  in 
terms  of  modern  control  theory  (see  for  example,  Schultz 
and  Melsa  1967).  The  idea  is  to  isolate  an  appropriate 
"performance  index"  P  and  defies  the  interpolated 
surface  to  be  the  one  that  minimises  the  integral  of  P 
subject  to  the  boundary  constraints.  Clearly,  one 
requirement  on  the  performance  index  is  that  a  minimal 
surface  should  be  guaranteed  to  exist  for  any  given  set  of 
boundary  values.  To  this  end,  Grimson  has  proved  the 
following  theorem:  Suppose  that  there  exists  a  complete 
seminorm  d  on  a  space  H  of  functions,  and  that  d 
satisfies  the  parallelogram  law.  Then  every  non-empty 
closed  convex  subset  of  H  contains  a  unique  function  of 
minimum  seminorm. 

A  number  of  plausible  performance  indices  can 
be  evaluated  in  the  light  of  this  theorem.  Grimson  notes 
that  the  mean  and  Gaussian  curvature  do  not  satisfy  the 
required  conditions.  He  suggests  using  instead  the 
quadratic  variation,  which  is  defined  to  be 
fx^+2fXy2+fyy2,  aiuj  derives  a  5  by  5  interpolation 
operator  to  compute  the  minimal  surface.  The  square 
Laplacian  also  satisfies  the  conditions  of  the  theorem,  but 
is  rejected  by  Grimson  on  the  grounds  that  its  null  space 
is  larger  (see  Grimson  1981,  section  3.1.4.)  Recently, 
Brady  and  Horn(forthcoming)  have  noted  that  any 
quadratic  form  in  fxx«  and  fyy  satisfies  the  theorem. 
They  have  shown  that  the  quadratic  forms  which  are 
rotationally  invariant  form  a  vector  space  which  has  the 
square  Laplacian  and  the  quadratic  variation  as  a  basis. 
Since  the  quadratic  variation  has  the  smaller  null  space,  it 
is  to  be  prefered. 

Brady,  Grimson,  and  Langridge(l$89)  use  an 
approximation  to  the  one  dimensional  quadratic  variation 
fXJ2  to  argue  that  subjective  contours  are  cubics.  The 
exact  minimal  integral  curvature  curve  has  recently  been 
found  by  Horn(1981V  Brady  and  Grimson  (forthcoming) 
use  these  ideas  about  surface  interpolation  to  propose 
that  subjective  contours  arise  from  surface  perception. 

Real  time  convolution 

At  the  last  workshop  we  took  delivery  of  a  number  of 
the  Hughes  Research  Laboratories  VLSI  multi-function 
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convolution  chips  prepared  for  the  DARPA  Image 
Understanding  Program  (see  Nudd  et.  al.  19S0.)  We 
began  to  construct  the  hardware  necessary  to  test  and 
evaluate  them.  Specifically,  a  "serpentine  memory"  wa> 
required  to  buffer  video  output  from  a  TV  camera  so 
that  26  successive  image  lines  could  be  fed  to  the 
convolver  in  parallel  at  video  rate*.  The  VLSI  circuit  is 
designer}  to  convolve  the  image  with  a  built-in  26  by  26 
difference  of  Gaussian  mask,  producing  a  video  signal 
that  lags  the  TV  input  by  26  lines.  In  the  process  of 
doing  this,  we  found  that  we  could  design  a  convolver 
using  digital  TTL  technology  that  could  process  one 
million  pixels  per  second.  Although  this  it  slower  than 
the  speed  claimed  for  the  Hughes  chip,  the  TTL 
approach  has  a  number  of  advantages  for  us.  The 
extensive  analog  circuitry  necessary  to  operate  the 
Hughes  chip  can  be  avoided.  The  all  digital  design 
provides  more  precision  than  is  possible  with  the  analog 
storage  of  weights  and  intensities,  and  digital  logic  is 
easier  to  debug  and  modify.  Finally,  variable  sized 
Gaussian  filters  could  be  provided. 

The  design  of  the  TTL  convolver  embodied  two 
principal  guidelines.  First,  the  convolver  was  required  to 
serve  js  the  first  stage  of  a  real  time  implementation  of 
the  Marr-Poggio-Grimson  theory  of  stereo  vision  (Marr 
and  Poggio  1979,  Grimson  1980,  1981).  Eventually  wt 
hope  to  be  able  to  carry  out  the  matching  process 
required  by  the  theory  without  using  large  "frame  buffer" 
memories  for  storing  the  intermediate  results  of  matching. 
The  freedom  from  having  to  use  a  frame  buffer  allows  us 
to  handle  images  with  an  arbitrary  number  of  lines.  The 
present  system  handles  1024  pixels  per  line  and  places  no 
restriction  on  the  number  of  lines  in  the  image.  Second, 
the  entire  system  was  designed  as  a  set  of  modules  that 
can  be  separately  developed.  Each  module  takes  an  array 
from  the  Lisp  machine  memory  as  input,  and  writes  its 
results;  into  another  array.  A  consequence  of  this  design 
decision  is  that  we  have  been  able  tc  write  software 
simulations  of  incomplete  parts  of  the  overall  system  on 
the  Lisp  machine  and  test  the  system  as  a  whole 
throughout  the  development  process. 

The  current  system  takes  its  input  from  a 
one -dimensional  solid  state  CCD  array  camera  in 
conjunction  with  a  mirror  deflection  system.  The  current 
camera  is  a  1024-element  linear  array  scanner.  A 
shortcoming  of  such  devices  is  their  limited  light 
sensitivity  resulting  from  the  brief  integration  time  as  the 
mirror  sweeps  over  the  image.  This  present  system  works 
well  with  studio  lighting. 


The  convolution  module  was  the  central  focus  of 
the  First  half  of  our  development  effort  because  of  .the 
large  computational  demar.di  it  makes.  For  digital 
Gaussian  convolution  with  a  32  by  32  mask  size,  a 
minimum  ot'  32  multiplies  ere  required  for  each  pixel  To 
maintain  adequate  precision,  the  first  one-dimensional 
convolution  requires  16  8x8  multiplies  while  the  lacond 
requires  16  8x16  multiplies.  Using  TRW  multiplier  chips 
we  are  able  to  process  just  under  one  million  pixels  per 
second.  Higher  pixel  rates  should  be  obtainable  with 
more  parallel  or  analog  designs  such  as  the  Hughes 
convolver  chip. 

A  central  issue  was  whether  to  compute  the 
Laplacian  of  z  single  Gaussian  convolution,  or 
approximate  it  as  the  difference  of  two  Gaunians  (Marr 
and  Hildreth  1980.)  Generally,  for  Gaussian  convolutions 
with  a  given  precision,  the  difference  of  Gaussian 
approach  offers  a  better  signal  to  noise  ratio  because  of 
the  second-order  differences  that  arise  from  computing 
the  Laplacian.  Designing  two  copies  of  the  same 
convolution  circuit  is  also  easier,  so  the  difference  of 
Gaussian  approach  was  selected. 

The  resulting  hardware  takes  8  bit  pixels  as 
input  and  prodvees  signed  16  bit  numbers  as  output. 
The  32x32  Gaussian  mask  size  allows  difference  of 
Gaussian  convolution  masks  with  central  positive 
diameters  of  between  2  and  12  pixels.  Using  array  to 
array  mode  on  a  Lisp  machine,  a  1000x1000  image  of  8 
bi‘  pixels  can  be  convolved  and  the  result  stored  as  a 
1000x1000  array  of  16  bit  numbers  in  about  20  seconds. 
Of  this  time,  only  1.5  seconds  is  needed  to  convolve  the 
image,  the  remainder  being  used  for  paging  between  disk 
storage  and  memory. 

The  Full  Primal  Sketch 

There  were  two  distinct  aspects  to  Marr’s(1976)  original 
discussion  of  the  Primal  Sketch  representation.  First,  the 
important  intensity  changes  in  the  image  are  made 
explicit  and  described  as  blobs,  terminations,  and  various 
kinds  of  scene  event  such  as  shading  edge.  This 
description  was  called  the  Raw  Primal  Sketch,  Second, 
the  Full  Primal  Sketch  is  a  hierarchical  representation 
that  results  from  making  explicit  the  local  geometrical 
structure  of  the  "place  tokens"  comprising  the  Raw 
Primal  Sketch  and  lower  levels  of  the  hierarchy. 
Marr(1976)  suggested  a  number  of  processes  which  can 
plausibly  discover  the  local  organisation  perceived  by 
humtins.  These  include  collinearity,  parallelism,  the 
formation  of  "clusters",  and  "theta  aggregation". 


For  example,  suppose  that  a  striped  surface 
texture  such  as  ruled  writing  paper  is  occluded  by  tome 
nearer  surface  such  as  a  book.  In  the  Raw  Primal 
Sketch,  the  line  terminations  corresponding  to  the  rulings 
of  the  paper  intersecting  the  bounding  contour  of  .the 
book  would  be  aligned.  That  geometric  arrangement  has 
to  be  discovered  as  a  curvilinear  aggregate,  and  a 
description  of  it  written  down  in  the  Full  Primal  Sketch. 
In  this  way,  the  Full  Primal  Sketch  aims  to  make  explicit 
the  geometric  properties  that  facilitate  the  interpretation 
of  the  Raw  Primal  Sketch  in  terms  of  physical  surfaces. 
Riley’s(1981)  forthcoming  thesis  examines  the  theoretical 
basis  for  the  Full  Primal  Sketch,  and  shows  that 
discontinuities  in  certain  attributes  of  the  Raw  Primal 
Sketch  description  are  reliably  correlated  with  edges  of 
physical  surfaces  white  others  may  either  be  due  to 
surface  creases  or  physical  edges.  Riley's  work  shed* 
some  light  on  the  confusing  psychophysical  phenomena  of 
texture  discrimination. 

Stevens(l981b)  has  worked  in  a  complementary 
area,  studying  how  collinearity  among  point!  is  detected. 
Marr(1976)  conjectured  that  grouping  (such  as 
curvilinear  aggregation)  should  be  performed  on  symbolic 
place  tokens  which  mark  distinguished  points  in  the 
image.  Stevens  has  developed  theoretical  and 
psychophysical  evidence  in  support  of  this  conjecture  for 
human  vision.  The  core  of  Marr’s  place  token  hypothesis 
is  that  grouping  and  aggregation  processes  operate  not  on 
the  image,  but  on  tokens  extracted  from  the  Raw  Primal 
Sketch.  There  have  been  proposals  that  edge  detection 
mechanisms  might  trigger  weakly  on  linear  groupings,  as 
blurring  a  line  of  dots  into  a  fuzzy  line  might  suggest. 
However  for  edge  detection  to  extract  the  coliinearity 
that  way,  there  should  be  a  reliable,  correlation  between 
the  physical  processes  that  cause  coliinearity  and  the 
physical  processes  that  cause  relatively  low  spatial 
frequency  intensity  changes  in  the  image.  That 
correlation  is  weak  and  unreliable,  as  can  be 
demonstrated  by  means  of  the  Marr-Hildreth  model  of 
edge  detection  (Marr  &  Hildreth  1980). 

Altogether,  our  increased  understanding  of  the 
full  Primal  Sketch  should  enable  us  to  extract  tnoi? 
reliable  texture  descriptors  from  images. 
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Our  principal  objective  In  this  research 
program  Is  to  obtain  solutions  to  fundamental 
problems  In  computer  vision  that  have  broad 
military  relevance,  particularly  In  the  areas  of 
cartography  and  photo  interpretation.  In  addition 
to  our  own  research,  we  are  designing  and 
Implementing  an  integrated  testbed  system  that 
Incorporates  the  results  of  research  produced 
throughout  the  image  understanding  community.  This 
system  vlll  provide  a  coherent  demonstration  and 
evaluation  of  the  accomplishments  of  ARPA'a  Image 
Understanding  Program,  thereby  facilitating 
transfer  of  this  technology  to  the  appropriate 
military  organisations. 


I  INTRODUCTION 


Research  at  SRI  International  under  the  ARPA 
Image  Understanding  Program  was  Initiated  to 
investigate  ways  In  which  diverse  sources  of 
knowledge  might  be  brought  to  bear  on  the  problem 
of  analyzlag  and  Interpreting  aerial  Images.  The 
initial,  exploratory  phase  of  research  Identified 
various  Teens  for  exploiting  knowledge  in  the 
processing  of  aerial  photographs  for  such  military 
applications  as  cartography.  Intelligence,  weapon 
guidance,  and  targeting.  A  key  concept  Is  the  use 
of  a  generalized  digital  map  to  guide  the  process 
of  image  analysis.  The  results  of  this  earlier 
work  were  Integrated  Into  an  Interactive  computer 
system  called  "Hawkeye"  [1].  This  system  provides 
necessary  basic  facilities  for  a  wide  range  of 
tasks  In  cartography  and  photo  Interpretation. 

Research  subsequently  focused  on  development 
of  a  program  capable  of  expert  performance  In  a 
specific  task  domain— road  monitoring.  The  primary 
objective  of  this  continuing  research  has  been  to 
build  a  computer  system,  called  the  Road  Expert, 
that  "understands"  the  nature  of  roads  and  road 
events.  It  la  now  capable  of  performing  such  tasks 
as 


*  Distinguishing  vehicles  on  roads  from 
shadows,  signposts,  road  markings.,  etc. 

*  Comparing  multiple  images  and  symbolic 
Information  pertaining  to  the  same  road 
segment,  and  deciding  whether  significant 
changes  have  occurred. 

The  general  approach,  and  technical  details  of 
the  Road  Expert's  components  are  contained  in 
References  [2-7].  He  have  integrated  these 
separate  components  Into  a  coherent  system  that 
facilitates  testing  nd  evaluation,  and  have 
transferred  this  system  to  the  ARPA/DMA  Testbed. 

In  the  most  recent  phase  of  our  worx,  we  have 
Initiated  major  efforts  in  two  new  directions.  The 
first  Is  in  support  of  a  Joint  ARPA/DMA  program  to 
provide  a  framework  for  demonstrating  and 
evaluating  the  applicability  of  Image  understanding 
research  (from  throughout  the  entire  IU  community) 
to  military  problems  In  general,  and  to  the 
problews  of  automated  cartography  In  particular. 

Our  plans  and  progress  In  this  effort  (ARPA/DMA 
Testbed)  are  described  In  an  appendix  to  this 
paper. 

The  goal  of  our  second  major  effort  is  to 
broaden  the  scope  and  generality  of  our  Image 
understanding  research — specifically  In  the  areas 
of  three-dimensional  terrain  understanding, 
perceptual  reasoning,  linear* feature  analysis,  and 
Image  description  and  matching.  A  parallel 
research  program  (described  In  Reference  [7]), 
jointly  supported  by  ARPA  and  KSF,  complements 
these  investigations  by  focusing  on  fundamental 
computational  principles  underlying  the  early 
stages  of  visual  processing  in  both  man  and 
machine. 


II  RESEARCH  PROGRESS  AND  ACCOMPLISHMENTS 


Our  current  research  efforts  are  focused  on 
three  somewhat  independent  problem  domains: 


*  Detection,  delineation,  and  interpretation 
of  linear  features  in  aerial  imagery. 

*  The  research  described  in  this  paper  is  based  on  work  performed  under  Advanced  Researcn  Projects  Agency 
Contract  No.  MDA903-79-C-0588. 


*  Imago  natching  and  image-to-Database 
correspondence  techniques  (Including 
landmark  selection  and  detection  techniques 
for  application  to  autcnoraous  vehicle 
navigation) . 

*  3-D  computation  and  interpretation  (stereo 
matching)  modeling,  and  raised-object 
cueing) . 


A.  Research  in  Linear-Feature  Analysis 

The  task  of  analyzing  and  interpreting  linear 
features  in  aerial  photography  has  a  natural 
partition  based  on  the  resolution  of  the  imagery. 

In  working  with  low-resolution  images,  where 
the  linear  structures  are  "line-like"  in 
appearance,  we  have  been  primarily  concerned  with 
techniques  for  automatic  delineation,  A  summary  of 
our  past  work  in  this  area  is  contained  in 
Reference  [6] .  In  thet  paper  we  described  road 
domain  applications  of  the  techniques  we  had 
developed  for  combining  and  linking  multisource 
information.  We  are  now  Investigating  the 
application  of  these  techniques  to  other  types  of 
low-resolution  linear  structures.  The  key  step 
here  is  finding  suitable  models  to  describe  the 
"local"  appearance  of  these  linear  structures. 
Figure  1  shows  some  preliminary  results  in  the 
detection  and  delineation  of  rivers. 

In  high-resolution  Imagery,  where  the  internal 
details  of  the  linear  structures  are  visible,  the 
models  and  tracking  techniques  are  necessarily 
somewhat  domain-specific.  A  very  siccessful 
approach  to  road  tracking  is  deserved  in  Reference 
(3).  Our  current  work  in  this  area  addresses  the 
problem  of  identifying  and  desc'  .bing  generic 
classes  of  objects  that  appear  within  the  road 
boundaries  (cars,  shadows,  road  markings,  etc.). 

Our  approach  is  first  to  dynamically  model  the 
nominal  road  surface  Intensity  pattern,  next  to  use 
this  model  to  reduce  the  background  to  a  relatively 
homogeneous  field  In  which  anomalous  objects  (l.e., 
those  Chat  occlude  the  nominal  road  surface)  are 
easily  detected,  and,  finally,  to  Inspect  and 
clesalfy  the  detected  objects.  Figure  2  shows  an 
exaivple  of  the  steps  in  this  process;  a  summary  of 
the  vork  is  contained  in  Reference  (9J. 


rt .  Research  in  Image  Matching 

Ima^e  matching  is  one  of  the  broadest  and  most 
basic  operations  in  scene  analysis.  Our  current 
work  in  this  area  is  concerned  primarily  with 
putting  an  Image  Into  correspondence  with  an 
existing  da, a  base.  This  is  a  requirement  for 
almost  all  cartographic  and  "knowledge-based"  image 
interpretation  applications  and  is  the  essential 
step  In  scene-  baaed  autonomous  navigat  on.  A 
summary  of  our  relevant  contributions  is  contained 
in  Reference  [4)  and  [8J . 

lu  Referenc.v  [8]  we  introduced  a  new  paradigm 
(AANSAC)  for  fitting  a  model  to  experimental  data. 
F.ANSAC  is  caoable  of  interpret Jng/smoothing  data 


that  contain  a  significant  percentage  gry,js 
errors.  It  Is  thus  Ideally  suited  for  ‘llcet<  ns 
in  automated  image  analysis  in  which  int'  .pretat  >n 
is  based  on  the  data  provided  by  error-;:  rone 
feature  detectors.  A  major  portion  o:  this  ;  <»r 

[8]  describes  the  application  of  RAX SAC  to  the 
T.ocation  Determination  Problem  (LDP):  given  an 
image  depicting  a  set  of  landmarks  with  known 
locations,  determine  that  point  in  space  from  which 
the  image  waa  obtained.  New  results  are  derived 
for  the  minimum  number  of  landmarks  needed  to 
obtain  a  solution,  and  algorithms  are  gi’-en  for 
computing  these  minimum-landmark  solutions  in 
closed  form.  These  results  form  the  basis  for  an 
automatic  system  that  can  solve  the  LDP  even  under 
severe  viewing  and  analysis  conditions. 

We  have  no w  begun  an  investigation  of  the 
problem  of  how  to  select  and  Identify  landmarks 
automatically — l.e.,  the  design  of  a  "Landmark 
Expert."  This  effort  will  benefit  from  our  work  on 
3-D  compilation  and  ralaed-object  cueing. 


C .  3 -Dimensional  Compilation  and  Interpretation 

We  have  initiated  a  number  of  new  efforts  in 
stereo  matching  and  stereo  modeling.  With  respect 
to  the  former,  we  are  investigating  the  feasibility 
of  an  adaptive  matching  technique  whose  parameters 
are  determined  by  the  local  geometric  and 
photometric  scene  characteristics.  As  regards  to 
stereo  modeling,  we  are  developing  methods  for 
analyzing  areas  that  cannot  be  matched,  as  well  as 
for  identifying  occlusion  edges  and  small  raised 
objects.  As  part  of  this  new  work,  and  also  as  an 
adjunct  to  the  testbed  effort,  we  have  written  a 
paper  [10]  that  presents  an  overview  and  evaluation 
of  the  state  of  the  art  in  stereo  compilation. 

Stereo  Matching 

The  goal  of  a  stereo  compilation  system  is  to 
generate  a  three-dimensional  model  of  a  scene, 
given  two  or  more  images  that  have  been  taken  from 
different  perspectives.  There  are  at  present  two 
techniques  for  automatic  stereo  matching: 
correlation  area  matching  and  low-level  feature 
matching.  Each  of  these  has  both  appropriate  and 
inappropriate  scene  domains.  Interestingly,  many 
of  the  domains  where  correlation  performs  well  are 
those  in  which  feature  matching  performs  poorly, 
and  vice  verta  [10].  Furthermore,  none  of  the 
current  matching  techniques  makes  use  of  the 
physical  constraints  that  result  from  knowledge  of 
the  illumination  source,  the  photometric  properties 
of  surfaces,  and  the  geometric  properties  of 
natural  and  cultural  objects.  We  are  investigating 
new  techniques  that  can  smoothly  combine  several 
existing  matching  techniques  to  exploit  available 
physical  constraints  and,  thus  attain  levels  of 
matching  performance  impossible  with  any  single 
technique. 

Major  improvements  of  the  feature-based 
approaches  should  result  from  increasing  the  number 
of  semantic  labels  in  the  classification  of  edges 
to  include  shadows,  occluding  contours,  and  changes 
in  orientation  of  a  photometrically  uniform 
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surfaces,  48  well  as  from  using  the  physical 
constraints  between  labels  to  guide  the  matching 
process  (relaxation  schemas  will  be  used  here)* 
Similarly,  improvements  in  area-based  (correlation) 
approaches  should  result  fror  combining  stereo 
anoMly  detection  with  the  matching  process  to  do  a 
more  effective  Job  of  geometrically  shaping  and 
warping  the  correlation  windows  (e*g*,  avoid  having 
a  correlation  window  overlap  an  occlusion  edge)* 

An  approach  to  the  Integration  of  feature-  and 
area-based  stereo  techniques  Is  to  first  decompose 
the  Images  Into  regions  according  to  local 
properties  that  affect  the  performance  of  the 
various  matching  techniques*  For  Instance*  regions 
that  are  photometrically  uniform  and  have  distinct 
boundaries  are  obvious  choices  for  edge-based 
stereo  techniques*  whereas  highly  textured  areas 
might  be  better  suited  to  area-based  methods* 

Stereo  Modeling 

In  real-world  scenes,  which  contain  such 
complex  3-D  structures  as  groups  of  buildings , 
trees,  and  other  elevated  objects,  there  will 
always  be  portions  of  the  imagery  that  cannot  be 
matched  because  of  occlusion,  excessive  perspective 
distortion,  and  tenoral  changes  In  the  scene. 

These  areas  will  be  referred  to  as  stereo  matching 
anomalies.  In  many  esses  the  existence  of  stereo 
matching  anomalies  can  provide  a  cue  to  the 
presence  of  raised  objects.  Our  research  goal  is 
to  classify  these  stereo  anomalies  using  supporting 
evidence  from  other  cues,  such  as  shadows  and 
monocular  features. 

He  are  considering  a  number  of  possible 
approaches  to  raised-object  cueing.  The  existence 
of  a  stereo  anomaly  Blight  trigger  a  search  to  find 
a  shadow  consistent  with  the  Illumination  geometry 
and  the  geometry  of  the  underlying  terrain. 
Alternatively,  knowledge  of  the  illumination 
geometry  and  the  underlying  terrain  might  be  used 
to  guide  a  highly  selective,  monocular  search  for 
shadow  edges,  which  In  turn  would  guide  a  search 
for  anomalies  In  stereo  matching.  In  addition  to 
the  above,  we  are  also  developing  techniques  for 
recognizing  the  presence  of  occlusion  edges  In 
single  black  and  white  Images  that  could  provide 
guidance  for  the  stereo  modeling  system. 
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APPENDIX 

ARPA/DMA  IMAGE  UNDERSTANDING  TESTBED  PLAN 

SUMMARY:  We  present  a  plan  for  the  activities 
supporting  the  ARPA/DMA  Image  Understanding 
Testbed.  A  detailed  schedule  is  given  for  the 
second  half  of  FY81,  together  with  an  outline  for 
the  FY82  activities.  The  testbed  efforts 
concentrate  on  three  main  areas:  1)  the  development 
of  a  software  and  hardware  environment  to  support 
the  requirements  of  the  contributed  software 
modules;  2)  acquisition  and  Integration  of  the 
contributions;  and  3)  demonstration,  testing  and 
evaluation  of  the  image  understanding  research 
contributions. 
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I  OVERVIEW  OF  THE  IMAGE  UNDERSTANDING  TESTBED 


A.  Background 

ARPA  has  for  many  years  been  a  primary  sponsor 
of  basic  research  in  computer  vision.  This  support 
was  consolidated  In  1975  into  a  broad. y  based 
research  program  in  image  understand!  lg  (IU),  the 
goal  of  which  was  to  explore  fundamental  coiqputer 
vision  techniques  that  could  be  applicable  to 
military  Image  interpretation  tasks. 

The  IU  research  program  has  now  produced  a 
substantial  body  of  results  Chet  have  considerable 
potential  for  near-term  application.  To  provide  a 
framework  for  demonstrating  some  of  these 
capabilities,  ARPA  and  DMA  have  agreed  jointly  to 
support  development  of  a  demonstration  system,  the 
ARPA/DMA  Image  Understanding  Testbed.  While  the 


immediate  goal  of  the  testbed  is  to  explore 
applications  of  direct  relevance  to  automated 
cartography,  the  ultimate  results  may  be  of 
interest  to  a  far  larger  community. 


B.  Progress 

SRI  International  was  selected  as  the 
integrating  contractor  to  Implement  the  testbed 
system.  Over  this  past  year  the  SRI  vision  group 
has  carried  out  the  definition,  planning,  and 
facility  implementation  tasks  of  the  testbed 
design.  A  basic  configuration  of  hardware  and 
software  has  been  established  that  now  forms  the 
core  of  the  testbed  facility.  In  the  following 
paragraphs  we  describe  the  hardware,  system 
software,  and  language  utilities  currently 
available  to  support  testbed  activities. 


1.  Hardware  Configuration 

The  principal  elements  of  the  current 
testbed  hardware  configuration  are  a  DEC  VAX-11/780 
central  processing  unit  and  a  DeAnza  refreshed- 
raster-  scan  display  system. 

The  DEC  KL-10  central  processing  unit  is 
also  accessible  from  the  testbed  VAX,  using  a 
teletype  line  for  communication  and  a  shared  disk 
drive  for  file  transfer.  Programs  on  the  KL-10  can 
access  the  DeAnza  display  system  through  the  VAX  as 
an  intermediary  employing  this  same  communication 
system. 

The  VAX  is  a  four-megabyte  system  with 
one  tape  drive,  two  RP06  disk  drives  (one  shared 
with  the  KL-'.O),  and  16  teletype  lines.  The  VAX 
interfaces  directly  with  a  variety  of  terminals,  a 
digitizing  table,  a  menu  tablet,  and  a  DEC  PDP- 
11/34  minicomputer  that  controls  the  DeAnza  display 
system  directly. 

The  DeAnza  ref reshed-rastcr-ecan  display 
system  has  a  resolution  of  512  x  512,  with  32  bits 
of  Information  per  pixel.  Eight  bits  each  are 
allocated  to  red,  green,  and  blue  data;  in 
addition,  there  are  eight  overlay  planes.  Each 
group  of  eight  bits  has  Its  own  lookup  table.  The 
output  of  the  four  lookup  tables  can  be  combined  In 
a  variety  of  ways  to  provide  Input  to  the  system's 
eight  DACe.  Ordinarily,  three  DACs  drive  one  RGB 
monitor,  three  drive  a  second  color  monitor,  and 
the  remaining  two  DACs  drive  two  monochrome 
monitors.  However,  a  special  crcssbar  arrangement 
has  been  designed  and  built  to  allow  the  DeAnza  bit 
planes  to  be  allocated  dynamically  among  our  two 
color  monitors  and  up  to  eight  monochrome  monitors 
located  with  terminals  throughout  the  site.  All 
DeAnza  graphics  operations  are  carried  out  by  the 
PDP-11/34  under  the  direction  of  the  VAX. 


2.  Operating  System  Facilities 

The  testbed  system  runs  under  the  UNIX 
operating  system,  which  is  currently  available  on 
the  SRI  VAX  through  an  interfacing  software  package 
(developed  at  SRI)  called  "EUNICE." 
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The  KL-10  presently  associated  with  the 
teetbed  VAX  rune  under  the  TOPS-20  operating 
system;  this  facility  la  available  to  run 
application  programs  developed  prior  to  the 
Initiation  of  the  teetbed  effort. 


3.  Language  Support 

High-level  programming  languages 
currently  available  on  the  testbed  VAX  are 
MAINSAIL.,  FRANZLISP,  and  C.  Extensive  graphics 
functions  are  currently  available  from  MAINSAIL, 
which  Is  a  high-performance  ALGOL-like  language 
similar  to  the  SAIL  language  available  on  PDP-10 
computers  (’MAINSAIL"  le  derived  from  ’MAchine 
independent  SAIL").  MAINSAIL  is  currently 
available  on  the  PDP-10  under  TOPS-20  and  on  the 
VAX  under  either  VMS  or  UNIX.  FRANZLISP,  a  variant 
of  LISP,  was  developed  at  U.C.  Berkeley;  It  is 
written  almost  entirely  In  C  and  ia  intended  to  run 
with  the  UNIX  operating  system  (it  runs  under 
EUNICE  on  the  testbed  VAX) .  The  testbed  graphics 
capabilities  are  currently  being  extended  to 
include  C  and  FRANZLISP. 

The  languages  MAINSAIL,  SAIL,  and  MACLISP 
are  supported  on  the  KL-1G.  Graphics  functions  are 
accessible  by  using  MAINSAIL  on  the  KL-10  to  send 
directives  through  the  VAX. 


C.  Planned  Activities 

Activities  designed  to  fulfill  the  goals  of 
the  Image  Understanding  Testbed  project  Will  fall 
into  three  major  categories;  development  of  system 
features  and  documentation;  acquisition  and 
Integration  of  contributed  image  understanding 
modules;  and  the  demonstration,  testing  and 
evaluation  of  the  image  understanding  research 
contributions . 

The  testbed  hardware  and  software 
configuration  will  be  enhanced  to  establish  a 
powerful,  self-contained  system  for  demonstrating 
the  results  of  Image  understanding  research.  An 
effort  will  be  made  to  configure  the  system  so  that 
similar  systems  could  easily  be  duplicated  at  other 
sites.  A  full  set  of  documentation  describing  the 
testbed  configuration  and  Its  use  will  be 
generated.  The  documentation  will  Include  a  user's 
manual,  a  description  of  the  hardware  and  software 
of  the  testbed  environment,  a  catalog  of  the 
supported  contributions,  and  a  catalog  of  the 
available  test  imagery.  The  documentation  for  the 
basic  utility  packages  will  establish  community¬ 
wide  standards  lor  the  Image  understanding  software 
environment  in  C,  FRANZLISP  and  MAINSAIL. 

Image  understanding  modules  will  be  proposed 
by  each  of  the  IU  research  groups  and  evaluated 
with  respect  to  their  suitability  for  the  testbed. 
Those  modules  that  are  appropriate  for  a  particular 
stage  of  development  will  be  acquired.  The  support 
structure  required  by  a  given  module  will  be 
analysed  and  the  module  Integrated  Into  the  testbed 
system. 


The  demonstration,  testing,  and  evaluation 
procedures  for  contributed  modules  are  currently 
being  developed.  As  an  Initial  i.tep,  we  are 
undertaking  a  study  of  the  fundamental  functional 
areas  and  paradigms  of  IU  research.  The  flret  such 
area  being  studied  la  stereographic  systems.  A 
paper  on  the  state  of  the  art  of  IU  stereo  modules 
Is  being  prepared  and  will  be  presented  at  the 
April  1981  Image  Understanding  Workshop.  As  the 
testbed  progresses,  analyses  of  various  IU  areas 
and  plans  for  appropriately  demonstrating  their 
capabilities  will  be  developed  regularly. 

The  testbed  program  will  be  lttylemented 
primarily  by  a  team  consisting  of  the  testbed 
coordinator,  the  testbed  system  programmer,  and  a 
person  with  a  vision  research  background  who  will 
play  a  major  role  In  the  Integration  of  contributed 
modules.  These  personnel  will  be  assisted  by 
members  of  the  SRI  vision  research  group,  as  well 
as  by  consultants  from  testbed  contributor  sites. 

In  particular,  research  personnel  are  expected  to 
play  a  substantial  role  in  formulating  the  test  and 
evaluation  plans  and  procedures  for  contributed 
modules . 


II  PROJECT  PLAN  FOR  THE  SECOND  HALF  OF  FY81 


A.  Overview 

Now  that  a  basic  testbed  system  concept  has 
been  established,  the  testbed  must  be  evolved  to 
accommodate  the  requirements  of  the  software 
contributions  anticipated  from  the  other  testbed 
participants.  The  proposed  contributions  must  be 
examined  and  selections  made.  Consulting 
assistance  by  the  original  authors  must  be  made 
available.  If  possible,  to  help  identify  the 
critical  Issues  Involved  in  adapting  each 
contribution  to  the  testbed  environment.  As  the 
list  of  nominal  contributions  Is  developed,  they 
will  be  brought  Into  the  testbed  and  Integrated 
into  its  environment.  The  knowledge  required  to 
carry  out  the  module  Integration  task  is  still 
being  acquired.  Thus,  the  actual  instantiation  of 
the  testbed  will  probably  differ  somewhat  from  the 
plans  presented  here. 

While  the  contributions  are  being  Integrated, 
we  shall  also  be  evaluating  the  state  of  the  art  of 
in  selected  areas  of  image  understanding.  This 
activity  will  provide  an  essential  basis  both  for 
the  decisions  to  acquire  specific  modules  and  for 
the  later  task  of  evaluating  the  Image 
understanding  research  results.  The  ultimate  goal 
is  to  develop  a  comprehensive  picture  of  the  nature 
and  capabilities  of  the  major  areas  of  the  Image 
understanding  research  program. 

Table  1  contains  a  list  of  the  major 
anticipated  activities  and  milestones  of  the 
Testbed  project. 


Table 


TESTBED  ACTIVITIES  AND  MILESTONES 


ACTIVITY  BEGIN  END 

( —  »  or. going 
activity) 


SYSTEM  DEVELOPMENT 
<sof tware> 


Operating  System  Environment 

In  progress 

— 

Graphics  and 

Image  Access  Utilities 

2/81 

7/81 

User  Interface 

4/81 

1/82 

<hardware> 

Image  Sc  anna  :: 

12/80 

3/81 

Display  ' 

3/81 

7/81 

Lisp  Machines 

8/81 

10/81 

Other  Hardware 

3/81 

11/81 

<documentatlon> 

Plans  and  Presentations 

11/80 

System  Environment 

11/80 

— 

Graphics  and  Image 

Access  Standards 

6/81 

8/81 

User's  Manual 

8/81 

10/81 

Contribution  and 

Image  Catalog 

6/81 

— 

ACQUISITION  AND  INTEGRATION  OF 

CONTRIBUTIONS 

Definition  . 

Evaluation  and  Selection 

In  progress 

— 

In  progress 

— 

Acquisition 

3/81 

— 

Testing  and  Modification 

3/81 

— 

Integration 

5/81 

— 

Teat  Image  Acquisition 

in  progress 

— 

DEMONSTRATION,  TESTING,  AND 
EVALUATION  OF  CONTRIBUTIONS 
Develop  Scientific 

Overview 

1/81 

4/82 

Establish  Test  and 

Evaluation  P'ans 

6/81 

4/82 

Demonstration  of 
Contributions 

6/81 

__ 

Evaluation  and 

Comparison  Tasks 

10/81 

— 

B. 


Development  of  System  Software.  Hardware  and 
Documentation 


The  testbed  system  environment  will  be 
extended  in  a  variety  of  ways  during  t'ue  remainder 
of  this  funding  period.  The  purpose  of  those 
enhancements  will  be  to  provide  support  to 
facilitate  evaluation  of  the  contributed  image 
understanding  modules.  The  anticipated  testbed 
system  activities  during  the  rest  of  EY8I  may  be 
grouped  into  three  subcategories i  software  tasks, 
hardware  tasks,  and  documentation  tasks.  These 
tasks  will  be  planned  by  the  testbed  coordinator 
and  carried  out  by  all  testbed  personnel. 


The  IU  modules  contributed  to  the  testbed 
will  be  written  In  a  variety  of  languages  and  will 
be  designed  for  several  different  environments. 

The  testbed  development  plan  is  centered  on  a  UNIX- 
based  environment  supporting  the  languages  C. 
FRANZLISP,  and  MAINSAIL.  A  number  of  aystem 
software  tasks  will  be  undertaken  to  support  the 
contributed  modules.  For  example,  we  shall 
establish  a  set  of  basic  software 
lnterconounlcatlon  capabilities;  the  objective  Is 
to  make  it  very  easy  for  programs  written  In 
different  languages,  such  as  C  and  FRANZLISP,  to 
function  cooperatively  In  an  Interactive 
environment.  Communication  facilities  will  also  be 
established  to  allow  Interaction  with  other 
computer  systems,  such  as  the  KL-10  and  LISP 
Machines.  Improvements  In  basic  system  utilities 
will  be  made  as  the  need  arises. 

A  software  documentation  system  will  be 
established  that  provides  a  framework  for  creating 
documentation  in  a  standardized  format.  The 
documentation  entered  In  this  fashion  will  be 
entered  into  a  database  and  we  will  be  retrievable 
by  a  suitable  documentation  access  system. 

GRAPHICS  AND  IMAGE  ACCESS  UTILITIES 

A  standard  set  of  graphics  and  image 
access  utilities  form  a  critical  central  part  of 
the  testbed  environment.  These  are  the  tools  that 
enable  the  rest  of  the  IU  modules  to  work  together 
in  an  efficient  way.  Parallel  sets  of  graphics 
utilities  with  essentially  the  same  capabilities 
will  be  written  in  C,  in  FRANZLISP  and  In  MAINSAIL 
to  allow  access  to  the  graphics  In  all  supported 
testbed  languages.  Image  file  access  will  be 
supported  by  parallel  utilities  In  MAINSAIL,  C,  ana 
FRANZLISP,  so  that  Image  files  can  be  accessed  and 
manipulated  In  all  of  the  supported  testbed 
languages.  The  Identification,  labeling  and 
retrieval  of  image  data  will  be  supported  by  an 
Image  retrieval  database  system.  We  are 
considering  modeling  the  Image  database  system  on 
the  HAWKEYE  project  previously  undertaken  at  SRI. 

USER  INTERFACE 

The  testbed  will  support  a  standard 
Interactive  user  interface  to  allow  demonstration 
of  and  experimentation  with  all  of  the  testbed 
capabilities.  The  user  Interface  will  be  a 
powerful  tool  enabling  exhaustive  testing  and 
evaluation  of  the  contributed  modules.  The  testbed 
user  interface  Is  currently  envisioned  as  a  LISP- 
based  set  of  interactive  utilities  that  can  be 
luvoked  either  In  Immediate  execution  mode  or  from 
9  stored  executive  file.  Prearranged 
demonstrations  can  be  assembled  in  the  form  of  a 
deferred  execution  file.  Experimental  evaluation 
can  be  carried  out  In  the  interactive  mode.  Where 
appropriate,  keyboard  command  entry  will  be 
supplemented  by  the  option  of  entering  commands  and 
command  parameters  by  means  of  a  pointing  device, 
such  as  a  digitizing  tablet.  For  example,  a  tablet 
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would  naturally  be  uaed  for  pointing  to  dellaitlng 
boundarlea  inalde  an  existing  display  while  a  user 
was  defining  a  new,  magnified  display. 

The  contributions  of  image  understanding 
research  software  would  be  organised  into  a 
documented  database  structure  of  their  own.  This 
system  would  contain  within  it  predefined 
demonstrations  of  each  module's  capabilities. 


2.  Hardware  Acquisition  And  Integration 

Additional  hardware  will  be  acquired  to 
enhance  the  present  capabilities  and  to  support  all 
the  requirements  of  the  testbed.  The  testbed 
hardware  will  be  Integrated  into  the  system, 
together  with  the  required  software  support  and 
utilities  necessary  to  make  use  of  it.  Listed 
below  are  the  major  items  of  testbed  hardware  to  be 
Incorporated  into  the  system  during  the  coming 
year. 

HIGH-RESOLUTION  IMAGE  SCANNER 

The  testbed  will  acquire  and  integrate  a 
high-resolution  optica),  scanner  for  the  purposes  of 
digitizing  film  Images.  The  scanner's  basic 
capabilities  will  be  to  digitize  film  Images  of  up 
to  10  x  10  inches  at  a  resolution  approaching  that 
of  the  film  grain,  approximately  10  microns.  A 
high  level  of  geometric  precision  will  be  provided. 
The  photometric  range  will  Include  8  bits  of 
monochrome  and  up  tr  8  bits  each  for  red/greea/blue 
scans  i  v-.-Oor  imagery.  The  scanner  will  support  a 
vari*’"?  of  functions.  First,  it  will  allow  testbed 
via*  •  t.)  bring  ir  own  test  data  in  the  form 
of  < .  a  -j.ntiLgary,  u,  iite  the  data  on  the  spot,  and 
t  iv  7  given  module  on  their  own  data.  Second, 
the  .  v  mar  will  provide  a  vital  opportunity  for 
exper  anting  with  imagery  digitized  under 
controlled  conditions;  testbed  algorithms  can  then 
be  evaluated  with  respect  to  their  sensitivity  to  a 
variety  of  factors  in  the  image  digitization 
process.  Finally,  the  scanner  will  support  the 
geometric  precision  and  photometric  accuracy 
required  to  carry  out  stereo  mapping  functions, 
shadow  identification  and  raised-object 
identification;  previous  vldicon-based  digitizing 
systems  have  proven  complete!"  Inadequate  for  these 
purposes. 

GRAPHICS  DISPLAY  SYSTEM 

A  refreshed-raater-scan  graphics  display 
system  compatible  with  a  majority  of  the  testbed 
contributors  systems  will  be  added  to  the  testbed 
system.  The  graphics  system  ‘  ’  extend  the 
capabilities  of  the  testbed  g  cs,  so  that 
existing  contributed  softws-  ..  be  Integrated 
more  easily  into  the  system.  Since  the  current 
testbed  graphics  system  is  a  specially  modified 
research  sy8tem  developed  at  SRI,  It  will  be  highly 
beneficial  to  the  testbed  to  have  a  standardized 
system  of  its  own  that  can  be  easily  duplicated  and 
maintained.  One  of  the  tentative  goals  of  the 
testbed  project  Is  to  develop  a  transportable  and 
universally  duplicable  system  with  capabilities 
adequate  for  all  Image  understanding  tasks.  The 


enhanced  graphics  system  plays  a  critical  role  in 
allowing  the  testbed  configuration  to  meet  these 
goals. 

LISP  MACHINES 

A  number  of  highly  desirable  Image 
understanding  contributions  srs  codsd  in  LISP- 
Machine  LISP  and  run  properly  only  on  the  MIT 
"CADR"  LISP  machines.  To  take  full  advantage  of 
the  capabilities  of  the  MIT  contributions  and  to 
accommodate  later  contributions  that  may  be  coded 
in  LISP-Machine  LISP,  It  would  be  highly  desirable 
to  install  at  least  two  LISP  Machines  as  part  of 
the  testbed  hardware  configuration.  These  machines 
would  operate  as  Independent  processors  for 
intensive  LISP  calculations,  thus  relieving  the  VAX 
of  the  substantial  processing  load  required  to 
Bervlce  such  computations.  The  LISP  machines  would 
comsunlcate  with  the  VAX  vie  the  Ethernet  hlgh- 
bandwldth  network,  using  existing  software  that  is 
available  for  the  LISP  Machines. 

OTHER  HARDWARE 

The  testbed  will  acquire  and  Integrate 
three  new  hlgh-capaclty  disk  drives.  The 
additional  disk  drives  will  be  used  to  replace  the 
shared  drive  that  is  being  lost  and  to  increase  the 
system's  apaclty  to  support  a  substantial  number 
of  users  and  their  test  imagery. 

A  high-quality  film  hard-copy  device  for 
photographing  intermediate  and  final  results  of 
image  understanding  computations  would  be  a  highly 
desirable  addition  to  the  testbed  hardware  system. 
Consequently,  IX  will  be  added  to  the  system 
whenever  appropriate  funding  is  available. 

Communication  with  other  computer  systems 
and  peripherals  will  be  essential  to  satisfactory 
operation  of  the  system.  We  have  already  begun 
Installing  an  Ethernet  network  on  the  Unlbus  of  the 
testbed  VAX.  High-speed  communication  channels 
will  be  established  using  the  Ethernet  to 
communicate  with  other  VAXs,  the  KL-10,  the  2060, 
and  the  LISP  Machines. 


3.  Documentation 

The  establishment  of  a  full  set  of 
documentation  describing  all  aspects  of  the  image 
understanding  testbed  system  will  be  a  critical 
part  of  the  testbed  project.  Among  the  specific 
documentation  tasks  that  oust  be  Included  In  the 
testbed  plan  are  the  generation  of  testbed 
proposals,  plans,  and  reports,  the  preparation  of 
informational  presentations,  descriptions  of  the 
testbed  hardware  and  software  environment,  and 
descriptions  of  both  the  scientific  and  user- 
related  properties  of  the  contributed  testbed 
modules. 

PROJECT  DOCUMENTATION - 

PROJECT  PLANS 
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Project  plans  detailing  tha  anticipated 
?J  development  of  tha  taatbad  will  ha  praparad  as 

[I  required,  Tha  plan  praaanted  In  thla  document  la 

m  tha  first  detailed  plan  attempted  for  tha  taatbad 

1  projact  aa  a  whole  -  Aa  tha  taatbad  avolvaa  and 

'  '  changes,  the  projact  plan  will  be  updated 

i  accordingly, 

f  PROJECT  PRESENTATIONS 

Periodically  the  status  of  the  testbed 
!  must  be  foraally  presented  to  such  groups  aa  the 

[  J  Testbed  Steering  Coaaittee  and  the  IU  principal 

investigators.  These  projact  presentations  will  be 
developed  as  required  to  coonunicate  information 
j  regarding  testbed  activities  to  concerned  parties, 

‘  SYSTEM  DOCt>  NTATION - 

HARDWARE  ENVIRONMENT 

A  succinct  but  comprehensive  description 
of  the  present  testbed  hardware  configuration  Is 
now  available.  This  document  will  be  expanded  and 
updated  as  the  testbed  hardware  configuration  and 
critical  Inforaatlon  about  the  configuration  are 
changed.  Further  diagrams  and  details  of  tha 
system  hardware  usage  will  be  added  to  improve  the 
(  utility  of  the  existing  descriptions. 

SOFTWARE  DEVELOPMENT  ENVIRONMENT 


A  brief  description  of  the  present 
testbed  system  software  development  environment  la 
currently  available.  As  additional  utilities  and 
support  software  become  available  or  are  generated 
for  the  testbed,  a  more  comprehensive  documentation 
scheme  for  the  testbed  will  be  established.  An 
Indexed  set  of  documentation  describing  the  basic 
software  characteristics  of  the  testbed  environment 
will  be  generated  and  periodically  updated  when 
appropriate. 

GRAPHICS  STANDARDS 

All  graphics  manipulation  functions  will 
be  available  with  three  alternative  environments  - 
MAINSAIL,  C,  and  FRANZLISP.  A  set  of  documents 
describing  the  available  testbed  graphics 
functions,  their  parameters,  and  their  usage  will 
be  generated  as  an  integral  part  of  the  testbed 
system.  The  MAINSAIL  documentation  Is  In  progress, 
while  the  C  and  FRANZLISP  documentation  will  be 
generated  concurrently  with  the  relevant  utility 
software. 

So  far  as  is  practical,  a  universal  set 
of  testbed  graphics  standards  will  be  created  that 
all  contributions  will  use  for  graphics  access. 

The  testbed  graphics  function  documentation  will 
estsbllsh  this  de  facto  standard. 

IMAGE  ACCESS  STANDARDS 

All  Image  access  functions  will  be 
available  in  three  alternative  environments  - 
MAINSAIL,  C,  and  FRANZLISP,  A  set  of  documents 
describing  the  available  testbed  image  access 
utilities,  their  parameters,  and  their  usage  will 


be  generated  aa  an  integral  part  of  tha  taatbad 
aystam,  Tha  MAINSAIL  documentation  Is  in  prograaa, 
while  the  C  end  FRANZLISP  documentation  will  be 
generated  concurrently  with  the  software. 

So  far  aa  la  practical,  a  universal  set 
of  taatbad  Image  accasa  standards  will  be  created 
that  all  contributions  will  use  for  Image  access, 
Tha  testbed  image  function  documentation  will 
establish  this  da  facto  standard, 

USER'S  MANUAL 

The  taatbad  user  Interface  system  will  be 
doscrlbed  In  a  detailed  taatbad  system  user's 
uanual.  The  usar'e  manual  will  contain  a  brief 
overview  of  the  teatbed  system  hardware  and 
aoltware  configuration,  together  with  Instructions 
for  using  tha  entire  system  from  within  tha  user 
Interface  shall.  The  methods  and  utilities 
available  for  carrying  out  tasting  and  evaluation 
procedures  will  be  described  In  detail.  Basic 
Instructions  for  using  the  program  development 
environment  will  be  supplied  together  with  pointers 
to  morn  complete  documentation.  The  user's  manual 
will  also  Indicate  how  to  use  the  on-line 
documentation  to  obtain  more  information  about 
specific  contributions  and  their  usage  within  the 
teatbed  context, 

TESTBED  CATALOGS - - - 

CONTRIBUTION  CATALOG 

The  capabilities  and  Image  understanding 
research  modules  available  on  the  testbed  will  be 
fully  documented.  A  contribution  catalog  will  be 
established  on  line  to  allow  quick  access  to 
essential  documentation  describing  tbs  use  of  the 
contributed  modules.  Each  contribution  will  be 
described  both  In  terms  of  Its  scientific  cor tent 
and  Its  practical  usage.  A  aerlaa  of  pointers  will 
be  set  up  among  related  routines,  as  well  as  among 
those  that  can  he  used  cooperatively  In  some 
fashion* 

TEST  IMAGE  CATALOG 

The  test  image  catalog  will  be  an 
essential  part  of  the  testbed  system's  on-line 
documentation.  It  will  contain  both  human-readable 
pointers  to  teat  data  falling  Into  various 
categories  and  mt>  chine-readable  pointers  that 
enable  the  user  to  select  particular  types  and 
Instances  of  test  data  with  minimum  knowledge  about 


the  explicit  file  structures  Involved.  The  test 
Image  catalog  will  work  In  concert  with  the  test 
Image  database  system  to  form  a  total  environment 
that  la  tentatively  based  on  the  SRI  Hawkeye  Image 
database  system. 


C.  Acquisition  and  Integration  of  Contributions 

A  major  portion  of  the  testbed  effort  over 
the  next  year  and  a  half  will  be  devoted  tc> 
acquiring  Image  understanding  contributions  from 
the  other  testbed  contributors  at  Carnagia-Mellon, 
MIT,  Stanford,  Rochester,  Maryland,  and  the 


University  of  Southern  California*  After  the 
acquirer  nodules  have  been  Integrated  Into  the 
testbed  system,  they  will  be  tested,  demonstrated, 
•uii  evaluated  to  weigh  their  technical  merlta*  The 
sequence  of  procedures  we  plan  to  follow  while 
adding  contributed  modules  to  the  testbed  system  la 
listed  below*  The  acquisition  and  Integration  of 
contributed  modules  will  be  planned  by  the  testbed 
coordinator  and  carried  out  by  testbed  personnel 
with  substantial  assistance  from  home  site 
consultants* 

DEFINITION 

Contributors  have  been  asked  to  describe  their 
proposed  modules  In  detail.  This  Information 
should  to  give  testbed  planners  a  sufficiently 
accurate  picture  of  the  capabilities  and  degree  of 
completeness  of  the  proposed  modules* 

SELECTION 

Proposed  modules  that  both  the  contributing 
organization  and  the  testbed  staff  consider 
potentially  suitable  for  Inclusion  In  the  testbed 
will  be  evaluated  more  Intensively*  Moat  or  all 
such  modules  will  be  demonstrated  at  their  home 
site  to  determine  ihelr  capabilities,  degree  of 
portability,  and  desirability*  Possible  problems 
In  transferring  the  modules  to  the  testbed 
environment  will  be  discussed  In  detail* 

Qualifying  modules  will  chen  be  selected  for 
Inclusion  In  the  testbed  system* 

The  qualification  and  selection  procedure  will 
be  carried  out  fn  stages*  Small,  simple  modules 
will  be  preferred  at  first.  As  the  testbed 
software  enviroment  la  enhanced  to  accommodate  the 
requirements  of  the  Initial  modules.  It  will  become 
easier  to  support  more  complex  contributions*  In 
addition,  modules  that  were  still  Incompletely 
Implemented  during  the  early  stages  of  the  module 
acquisition  process  will  gradually  be  completed. 

As  this  Is  accomplished,  adlltional  contributions 
will  be  qualified  for  Inclusion  In  the  testbed. 

ACQUISITION 

Modules  that  are  selected  for  inclusion  In  the 
testbed  at  any  given  stage  will  be  prepared  by  the 
contributor  for  transport  to  the  testbed.  In  many 
cases,  this  preparation  will  be  sufficiently 
difficult  that  software  consultants  familiar  with 
the  contribution  and  Its  structure  may  visit  the 
testbed  site  In  person  to  assist  In  making  that 
contribution  operational*  Acquisition  of  a  given 
contributed  module  will  Include  as  ouch  support 
software  as  Is  practical* 

TEST  AND  MODIFY 

Acquired  image  understanding  modules  will  be 
tested  and  modified  to  fit  into  the  textbed 
environment.  When  necessary,  consultants  from  the 
respective  module  development  sites  will  be 
employed  to  carry  out  this  procedure  efficiently* 
This  activity  la  the  eaaantlal  prerequisite  to 
Integrating  the  contribution  Into  the  teethed 
system. 


INTEGRATION 

When  a  contributed  module  la  sufficiently  well 
underatood  so  that  Its  entire  support  a  .ructure  has 
been  properly  Implemented  In  the  testbed 
environment,  the  module  can  be  Integrated  Into  the 
testbed  system  Itself.  At  this  stage  each 
contribution  will  be  added  to  the  list  of  processes 
that  can  be  Interactively  Invoked  and  experimented 
with  on  the  testbed*  Appropriate  ways  of  passing 
data  to  the  module  from  the  Interactive  environment 
will  be  established  and  documented  and,  when 
deslrrble,  lntarfacas  to  other  modules  will  be  put 
in  place*  One  or  more  test  case?  will  be  set  up  to 
demonstrate  the  capabilities  of  each  module. 


D .  Demonstrat Ion*  Testing,  and  Evaluation  of 

Cone  rlbutlons 

The  primary  scientific  task  of  the  testbed  Is 
the  evaluation  of  the  contributed  image 
under ttanding  research  modules  In  a  uniform 
context*  This  task  will  be  broken  down  Into  a 
number  of  distinct  stages.  SRI  testbed  and 
research  staff  will  cooperate  In  developing  a 
scientific  overview  of  the  state  o*.  the  art  of 
Image  understanding  research.  From  this  work  will 
corns  a  more  detailed  plon  for  testing  and 
evaluating  tae  contributed  modules.  The  actual 
demonstration,  testing,  and  evaluation  of  the  image 
understanding  research  contributions  will  be  based 
on  that  plan  and  will  be  carried  out  primarily  by 
the  testbed  personnel. 

SCIENTIFIC  OVERVIEW 

A  scientific  overview  of  the  state  of  the  art 
of  lmsge  understanding  research  will  be  dovexoped 
with  the  participation  of  Image  understanding 
researchers  from  the  community  of  testbed 
contributors*  The  baric  IU  paradigms  will  be 
identified,  the  nature  of  the  IU  research 
contributions  to  eaci  area  described,  and 
appropriate  test  domains  noted.  This  process  will 
result  in  a  series  of  scientific  papers  reviewing 
and  evaluating  the  status  of  the  major  IU 
functional  areas.  The  following  areas  have 
tentatively  been  identified  for  this  activity: 

*  Stereographic  reconstruction 

*  Linear  feature  analysis 

*  Image  matching 

*  Pattern  recognition  and  segmentation 

*  Miscellaneous  (including  sensor  prediction, 
texture  analysis,  shape  from  shading,  etc.) 

ESTABLISHING  THE  TEST  AND  EVALUATION  PLAN 

As  the  scientific  overview  is  developed,  w, 
shall  formulate  a  strategy  for  tasting  and 
evaluating  the  IU  contributions  in  each  of  the 
major  technical  areas.  A  plan  will  be  proposed  for 
carrying  out  the  eveluatlon  procedure.  Testing 
methods  will  be  proposed  and  specific  evaluation 
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measures  suggested*  Appropriate  domains  of  test 
data  will  be  determined  for  each  of  tha  functional 
areas,  as  well  as  for  the  corresponding 
contributions* 

DEMONSTRATION  OF  CONTRIBUTIONS 

When  a  contributed  image  understanding  nodule 
has  been  fully  Integrated  Into  the  testbed  system. 
It  will  be  added  to  the  set  of  testbed 
demonstration  packages*  Facilities  will  be 
established  to  use  the  nodule  not  only  with  special 
test  caseo,  but  also  t.ith  any  set  of  test  data  the 
user  wishes  to  try*  Home  site  consultants  may 
again  be  used  at  this  time  to  optimise  the  quality 
of  the  demonstrations. 

EVALUATION  AND  COMPARISON 

The  final  step  In  the  testbed  program  will  be 
.o  t-  aluate  completely  Integrated  Image 
und ; islanding  modules  according  to  the  teat  and 
eviuuatlon  plan*  When  possible,  similar  modules 
will  be  compared  with  one  another*  Further  work 
will  be  done  to  determine  which  measures  are 
appropriate  for  the  evaluation  of  particular 
modules,  which  Image  domains  are  appropriate,  and 
which  imagery  is  representative  of  these  domains* 
All  conclusions  and  results  of  the  evaluation 
procedure  will  be  fully  documented* 


111  PROJECT  FLAN  FOR  FISCAL  YEAR  1982 


The  major  testbed  rysteu  tasks  during  the 
second  half  of  FY81  involve  improvements  In  the 
testbed  environment,  Integration  of  contributed 
software,  and  the  establishment  of  a  plan  for 
evaluating  contributed  modules.  Additional 
hardware  is  scheduled  to  be  lv.corporated  Into  the 
system  and  the  software  utilities  will  be 
substantially  extended*  An  Initial  set  of  Image 
understanding  research  contrlb itiona  will  be 
acquired  and  the  testbed  environment  prepared  to 
support  them  as  they  are  Integrated  into  the 
system*  The  state  of  the  art  of  Image 
understanding  research  will  be  .'otermined  and  a 
coordinated  plan  for  testing  aw  evaluating  the 
contributed  modules  will  be  foimulated. 

The  specific  tasks  to  be  carried  out  in  FY82 
depend  largely  on  the  manner  in  which  the  project 
develops  during  the  remainder  of  FY81.  However, 
many  of  the  activities  anticipated  for  FY82  will 
clearly  be  extensions  of  those  begun  earlier. 

Tasks  started  In  FY81  will  be  further  defined  and 
brought  to  completion  In  FY82.  Segments  of  the 
system  will  be  redesigned  and  Improved  when 
necessary*  Obviously,  the  system  requirements  will 
have  to  be  redefined  as  the  testbed  staff  and  users 
gain  experience  with  the  system. 

The  general  thrust  of  testbed  activities  In 
FY82  will  be  to  acquire.  Integrate,  end  evaluate  a 
substantial  additional  number  of  contributed  image 
understendlng  research  results  and  to  complete  the 


testbed  contribution  evaluation  program.  As  the 
library  of  Image  understanding  boftwsrs  supported 
by  the  testbed  grjws,  the  scientific  overview  and 
the  evaluation  plan  will  be  extended  and  modified* 
Additional  sclantlflc  reporta  will  be  generated  to 
evaluate  the  state  of  the  art  of  the  Image 
understanding  field  and  to  point  out  areas  of 
strength  and  weakness*  These  results  will  help  to 
clarify  the  particular  modules  that  should  be 
supported  by  the  teatbed,  as  well  as  to  pinpoint 
arses  In  which  additional  work  and  raeearch  would 
be  warranted* 

The  tasks  Involved  In  the  demonstration, 
testing,  and  evaluation  of  contributed  modules 
(described  in  the  preceding  section)  will  bo 
continued  and  extended.  In  particular,  strategies 
will  be  developed  for  demonstrating  the 
capabilities  of  each  module  and  the  modules  will  be 
tested  and  evaluated  with  respect  to  their  relative 
performance  and  scientific  merits.  The  principal 
contents  of  the  final  report  misting  to  testbed 
actlvitlee  will  be  a  detailed  description  of  the 
results  of  the  evaluation  procedure* 


IV  STATUS  AND  PLANS  FOR  SPECIFIC  CONTRIBUTIONS 


The  purpose  of  this  section  la  to  present  a 
summary  of  the  specific  image  understanding 
research  contributions  that  are  expected  to  become 
a  part  of  the  testbed  system.  Each  contributor  has 
provided  the  testbed  ataff  with  a  list  of  proposed 
modules  and  a  description  of  their  characteristics. 
These  specific  descriptions  of  what  eacn 
contributor  could  provide  have  been  collected  In  a 
computer  file  for  reference.  While  the  contents  of 
this  lengthy  file  will  not  be  Included  here,  we 
plan  to  compile  an  appropriately  edited  version 
containing  a  full  description  of  each  contribution 
actually  received  by  the  testbed. 

Each  contributor  has  been  contacted  and  asked 
to  furnish  Information  regarding  the  moat 
appropriate  Initial  contribution  from  that 
contributor's  site.  The  proposed  contributions  are 
being  evaluated  and  are  being  prepared  by  the 
contributors  for  transport  to  the  testbed*  The 
contributors  have  also  provided  Information  on  the 
contributions  they  anticipate  having  ready  before 
the  end  of  FY82.  In  Table  2  we  summarise  the 
presently  planned  contributions  to  the  testbed  from 
each  participant.  The  degree  of  the  contributor's 
commitment  to  providing  each  given  module  la 
indicated,  along  with  approximate  dates  for  the 
prospective  acquisition  of  the  module  and  its 
integration  Into  the  testbed  system* 

The  preferable  form  of  the  delivered 
contributions  would  be  in  one  of  the  languages 
supported  on  the  testbed  VAX.  Initial 
contributions  will  be  analysed  from  the  standpoint 
of  their  support  requirements  and  an  appropriate 
set  of  graphics  and  Image  access  standards  will  be 
formulated*  While  several  attempts  to  define 
testbed  standards  have  already  been  attempted,  the 
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Table  2 

SUMMARY  OF  CONTRIBUTTONS  AMD  CONTRIBUTION  SCHEDULE 
CONTRIBUTION  COMMITMENT  ACQUISITION  INTEGRATION 

SRI 


Road  Expert 

firm 

complete 

5/81 

RANSAC 

firm 

complete 

5/81 

CHU 

(1st  1 2nd 
version) 

(1st )  2nd 
version) 

Segmentation 

firm 

4/81)9/81 

6/81)12/81 

tntervlelblllty 

likely 

6/81)9/81 

9/81)12/81 

Stereo  (Moravec) 

likely 

8/81)12/81 

K '81)3/82 

STANFORD 

[Stereo 

-  aee  CMU] 

[firm] 

[3/8 l) 

[3/82] 

Camara  Solver  and 

3D  Obstacle  Finder 

(Cannery) 

likely 

6/81 

IC/81 

Line  Finder 

likely 

6/81 

11/81 

ACRONYM 

likely 

4/82 

6/82 

MARYLAND 

Relaxation  firm 

Interactive  Segmentation 

3/81 

5/81 

Package 

possible 

9/81 

12/81 

ROCHESTER 

Hough  Transform 

firm 

6/SI 

8/81 

Strip  Trees 

possible 

10/81 

12/81 

MIT 

[Require  MIT  LISP 

Machine) 

likely 

8/81 

10/81 

Stereo  (Marr) 

it 

n 

•» 

Shape-f  ronr-Shad  j.ng  " 

•« 

«t 

use 

Linear  Faaturea 

likely 

available 
In  SAIL 

9/81 

Law's  Texture 

Analysis 

likely 

available 
in  SAIL 

12/81 

(mage-to-Map 

Correspondence 

likely 

available 
In  SAIL 

4/82 

aavaca  difficulties  encountered  In  aatabllahlng 
Ideal  atandarda  have  indicated  that  they  are  beat 
developed  In  an  Iterative  fashion  on  the  baala  of 
axle ting  toft vara.  Thue,  the  initial  contrlbutlona 
will  help  define  the  atandarda  to  be  followed  in 
later  contrlbutlona  and  in  later  veraiona  of  the 
initial  contrlbutlona*  Wo  ahell  depend  upon  some 
aupport  free  each  contributing  lnetltutlon  to  aid 
in  fitting  the  contrlbutlona  Into  a  uniform 
environment. 

The  scientific  context  of  each  of  the 
contributed  modules  will  be  defined  by  the 
scientific  overview  papers  mentioned  in  Section  li¬ 
ft.  The  evaluation  procedu  as  and  criteria  to  be 
used  in  the  scientific  assessment  of  each  module's 
capabilities  will  be  derived  from  the  beslc 
material  in  those  papers.  The  appropriate  data 
sets  to  be  used  in  testing  the  contributions  will 
else  be  characterised  In  the  overviews.  The  final 
output  of  the  testbed  will  be  a  systematic 
evaluation  and  characterisation  of  the  merits  of 
each  contribution,  together  with  an  envlronnmnt  In 
which  the  contributions  can  be  flexibly  tested  and 
demonstrated. 


VI I  CONCLUDING  REMARKS 

The  image  understanding  testbed  will  establish 
a  standard  environment  in  which  the  Image 
understanding  research  modules  can  be  tested  and 
evaluated.  A  variety  of  IU  research  results, 
Implemented  in  the  testbed  system,  will  be 
accessible  for  evaluation  and  comparison.  Once  In 
place,  the  teutbed  will  form  a  framework  which 
other  concerned  parties,  such  as  the  Defense 
Mapping  Agency,  can  ute  to  adapt  Image 
understanding  resaarch  capabilities  to  their  own 
specific  applications. 
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U)  ORIGINAL  IMAGES 


(b)  BRIGHT  POINTS  IN  BINARY  MASK  SHOW  WHERE  IMAGES  HAVE  ■RIVER-LIKE"  APPEARANCE 


(el  RIVER  DETECTION  MASK  OVERLAID  ON  ORIGINAL  IMAGES 

FIGURE  1  DETECTING  AND  DELINEATING  RIVERS  IN  AERIAL  IMAGERY 


234 


SPATIAL  UNDERSTANDING 


Thomu  O.  Binford 


Artificial  Intelligence  Laboratory,  Computer  Science  Department 
Stanford  University,  Stanford,  California  94305 


Abstract 

The  ACRONYM  system  has  bean  extended  greatly  over 
the  past  year.  It  proved  successful  in  preliminary  tests  in 
identifying  aircraft.  ACRONYM  is  domain-independent;  its 
knowledge  of  aircraft  is  in  the  form  of  models.  None  of  its 
rules  and  perceptual  mechanisms  are  specialized  to 
aircraft. 

Progress  has  been  made  in  stereo  vision  in  geometric 
constraints  for  general  surfaces  and  for  special  surfaces. 
General  new  theoretical  results  on  understanding  line 
drawings  In  Images  appear  applicable  to  solids  with 
curved  surfaces  with  surface  markings  and  shadows, 
with  paper,  and  wires.  In  an  example,  a  three-space  model 
Is  built  up  of  an  aircraft  from  a  single  view,  using  general 
constraints  with  no  knowledge  of  the  object  or  surfaces. 
The  system  Identifies  shadows  and  uses  them  to  lnfar  the 
height  of  surfaces.  Several  new  constraints  on 
correspondence  of  edges  have  been  discovered  and 
Incorporated  In  stereo  matching.  Two  results  relevant  to 
biological  vision  systems  are  found  to  agree  with 
experiment. 

Concerning  special  surfaces,  preliminary  results  are 
reported  for  monocular  constraints  on  appearance  of 
orthogonal  trihedral  vertices  in  image  sequences. 

Introduction 

Our  research  program  focusses  on  the  ACRONYM 
system  for  perception  and  planning  action  in  the  real 
woild.  ACRONYM  Integrates  perceptual  algorithms  in  a 
total  system  which  interprets  images  as  spatial  structures 
in  three-space.  Our  Interest  in  building  a  total  system  is  to 
solve  fundamental  problems  of  tepresenting  geometric 
structures  and  transformations.  The  effort  in  systems  is 
worthwhile,  too,  In  providing  an  experimental  system  to 
build  on  for  further  research.  An  essential  »urt  of  the 
research  couples  ACRONYM  to.  real  world  inputs,  by 
developing  fundamental  algorithms  for  constructing 
symbolic  descriptions  of  images  and  sequences  of  Images, 
and  mapping  them  to  symbolic  descriptions  of  surfaces 
and  objects. 

ACRONYM  is  the  basis  for  performance  systems  in 
photointerpretation  and  cartography,  systems  which 
promise  io  be  generalizable  from  a  few  specially  chosen 
examples  in  a  research  enr  ronment  to  realistic  variation 


In  an  operational  .letting.  Now,  bas.td  on  recent  progress, 
we  see  perceptual  components  and  an  Integrated  system 
whose  performance  within  the  short  term  will  he 
striking  by  current  standards.  Defense  applications  of 
Image  understanding  are  really  quite  difficult  and  demnd 
that  level  of  performance,  automating  stereo  compilation 
for  building  complexes  and  cultural  sites;  automating 
classification  tasks  of  photointerpreters;  guidance  and 
targeting.  We  set  a  short  term  objective  of  demonstrating 
interesting  performance  in  example  tasks,  performance 
adequate  to  see  a  clear  path  to  exploit  image 
understanding  technology  for  carefully  selected  high 
performance  systems 

ACRONYM 

ACRONYM  is  a  perceptual  system  with  geometric 
modeling  and  geometric  reasoning  in  the  form  of  a 
powerful  problem-solving  system  [Brooks  79],  Several 
objectives  have  Influenced  the  design  of  ACRONYM. 
ACRONYM  should  be  natural  to  program;  it  is  programmed 
from  gbometric  models.  ACRONYM  should  incorporate  all 
available  knowledge  and  information;  it  maps  knowledge 
and  data  Into  geometric  constraints  on  various  geometric 
structures  of  a  geometric  representation  hierarchy. 
ACRONYM  should  be  capable  of  perception  without 
knowledge  of  viewpoint;  it  has  viewpoint-independent 
volume  models. 

In  ACRONYM,  objects  are  represented  as  part-whole 
graphs  whose  primitives  are  generalized  cylinders.  One 
objective  of  research  is  to  provide  means  for  representing 
object  classes  and  identifying  them.  Thus,  ACRONYM  has 
models  for  passenger  aircraft  and  models  for  L-1011  and 
747  as  aircraft.  It  has  succeeded  in  identifying  707's  as 
aircraft  without  a  model  for  the  707.  Typical  systems 
Identify  an  aircraft  by  attempting  separntn  identifications 
as  an  LlOil,  747,  or  707,  without  the  least  connection 
among  them.  In  those  systems,  an  L101 1  is  no  more  and  no 
less  related  to  a  747  than  to  a  car  or  tank.  Generalized 
cylinders  provide  a  means  of  generic  representation 
(object  classes)  as  abstract  structures  of  volumes. 
ACRONYM  provides  class  restriction  and  quantlfiars  as 
further  machanisms  for  generic  representation.  A 
quantifier  is  a  variable  whose  value  is  partially 
determined  by  a  system  of  constraints.  An  L1011  Ir  a 
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rente  jtUoc  of  the  class  passenger  aircraft. 

[tsiooks  79]  described  a  first  test  of  ACRONYM  in 
Identifying  an  L-1011  from  Jn  aerial  photograph  of  San 
Francisco  airport.  After  a  few  such  experiments, 
extensions  to  ACRONYM  were  designed  and  implemented 
over  the  course  of  more  than  a  year  [Brooks  60].  These 
extension  introduced  the  major  capability  to  extract  three 
dimensional  information  from  Images.  [Brooks  81] 
describes  the  first  series  of  experiments  testing  the  new 
system.  ACRONYM  performed  well  with  poor  data  from 
small  images.  At  this  point,  the  major  limitation  is  the 
quality  of  th?  symbolic  descriptions  it  receives  from  edge 
and  ribbon  finding  programs.  We  used  to  say  that  even 
though  edge  finders  are  weak,  we  can  still  see  a  lot  in 
those  Images,  "why  can't  computers?".  That  comment  it 
no  longer  valid.  In  this  case,  ACRONYM  understood  pAor 
■iia'.a  Vcti  sr  than  its  author.  Of  course,  he  could  do  better 
lur'yi  ..H.O>'OWYM  with  the  original  image,  instead  of  the 
fa  jfctji  ge  data.  ACRONYM  decided  from  shape  that  one 
tdrcr.if  i  ./i  !  consistent  with  L-1011  and  two  aircraft  in 
the  uxtun  ,»l6i  wens  not  747s  or  L-101  Is.  It  did  not  use  size, 
a|:th  5  aj  ,h  ACHO.' ■  V  ;>f  concluded  separately  that  they  were 
t<»  j  r  it  11  ,.o  be  7  1 7. ;  i  i  i  he  other  were  an  L- 1 0 1 1 . 

I  C  IOYYM  had  models  of  aircraft,  a  camera  model 
with  elevation  between  1000  and  12000  meters.  It  had 
very  li  ttl  i  Information  about  expected  apparent  size. 
ACR0.4VM  has  no  special  knowledge  of  aerial  scenes.  All 
its  rule  i  are  about  geometry  and  algebraic  manipulation. 
Its  perf  >rnance  would  be  Impressive  if  it  were  a  special 
progr.m  dnvoted  to  aircraft.  It  is  more  impressive  that  its 
meclnmi.sn  s  have  nothing  to  do  with  aircraft,  only  Its 
models.  ACRONYM  can  be  expected  to  perform  well  in 
Identifying  objects  such  as  vehicles  by  substituting  models 
for  vehl  cl*  J, 

Ext  en;  ions  to  ACRONYM  included:  generalization  of 
the  rule  language  to  simplify  the  rules  and  to  allow  rules 
to  manipulate  object  graphs  and  prediction  graphs,  in 
order  to  make  control  uniform;  introduction  of 
quantifiers,  as  described  above;  Introduction  of  constraint 
expressions  involving  variables;  symbolic  manipulation 
and  simplification  of  geometric  transforms;  incorporation 
and  extension  of  a  mechanism  to  test  satisf lability  of 
constraint  si  constraints  as  mechanisms  for  determining 
parameter  s  of  models  from  observations  (also  [Lowe  80])i 
prediction  with  parameterized  expressions;  a  rule-based 
matcher. 

ACRONYM'S  approach  in  identifying  aircraft  is  to 
make  predictions  about  the  appearance  of  aircraft,  i.e.  to 
predict  that  on  a  coarse  level,  fuselage  sad  wings  will  be 
observable.  It  predicts  automatically  image  features  and 
relations  which  are  invariant  or  quasi-invariant  over 
variations  in  the  model  and  camera  parameters.  ACRONYM 
predicts  shape  as  ribbons  and  ellipses.  ACRONYM'S  data  are 
ribbons  found  by  a  ribbon  finder  [Brooks  79b]  based  on 
output  from  the  Nrvi.ua  and  Babu  edge  finder  [Nevatia 
76].  ACRONYM  usei  it.;  predictions  *o  analyze  ribbons  to 
select  candidates  for  il  reran,  Matches  of  image  features  to 
model  features  are  represented  by  constraints.  A  constraint 
satisfiability  test  rtmoves  inconsistent  Interpretations. 
Const,  tints  on  image  measurements  partially  determine 
model  parameters.  T1  >9  determination  of  model  parameters 
allows  a  further  prediction  of  fins  detail,  for  example, 
prediction  of  engine  ,w*!s  and  horizontal  stabilizers  based 
on  the  Identification  and  location  of  an  L-1011.  Those 
predictions  can  be  tested  in  order  to  verify  identification 
to  much  higher  detail. 


Image  Description 

Performance  of  ACRONYM  and  stereo  programs  are 
both  dependent  on  quality  of  symbolic  descriptions  of 
images  as  curves  and  ribbons.  In  the  near  future, 
significantly  improved  curve  descriptions  are  expected. 
[Blnford  61]  describes  new  theoretical  results  or 
describing  Image  boundaries,  based  on  extension  of  th« 
Binford-Horn  line  finder  [Horn  72],  These  results  have 
not  been  implemented  yet.  An  argument  Is  made  tluii 
coarse-to-flne  operation  is  not  useful;  that  flne-to-coarse 
is  preferable  becauie  the  output  of  fine  operators  can  be 
used  to  eliminate  the  effect  of  small  detail  from  coarse 
operators  in  a  way  that  simple  weighted  averages 
(frequency  filtering)  cannot  do.  A  conjecture  was  made 
about  human  performance  on  discriminating 
darker-vs-lighter  for  two  background  areas  with  readily 
discriminate  fine  texture.  The  contrast  between  the 
backgrounds  is  chosen  to  be  small  so  that  the 
discrimination  is  possible  only  over  large  areas.  The 
contrast  and  density  of  small  features  is  chosen  so  that 
their  avetage  intensity  reverses  the  contrast  between  the 
two  regions.  The  apparent  result  from  a  casual  experiment 
Is  that  humans  detect  the  difference  between  backgrounds 
when  the  small  features  are  discernible,  end  perceive  the 
difference  in  the  opposite  sense  when  the  small  features 
are  not  dlscernable.  Allan  Miller  and  Craig  Hublee 
synthesized  the  test  Images.  The  conclusion  from  this 
experiment  is  that  position-invariant  operators  cannot  be 
used  at  the  coarse  .level  to  explain  human  performance, 
nd  should  not  be  used  by  machines.  The  operators 
introduced  in  [Marr  77]  are  thus  not  adequate  to  explain 
human  performance.  That  paper  describes  a  solution  to 
directional  boundary  operators 

[Mac Vicar- Whelan  81]  describes  a  boundary  finder 
with  sub-pixel  accuracy.  This  is  Intended  as  a  processor 
with  Intermediate  performance  requiring  small 
computation.  It  is  based  on  an  approach  similar  to 
Blnford-Hof.  iHorn  72]  and  [Marr  79], 

[Miller  81)  describes  a  VLSI  implementation  of  a 
vision  processor.  This  is  an  initial  step  toward  high 
performance  computation  of  image  features  such  as 
boundaries. 

Recent  results  suggest  renewed  support  for  the 
following  design  of  a  vision  system:  determination  of 
image  features  with  operators  over  a  range  of  sizes;  the 
output  of  each  size  stage  is  used  to  remove  boundaries  for 
the  Image  data  input  to  the  next  coarser  stage;  geometric 
grouping  operations  are  performed  on  outputs  of  each 
stage,  including  the  finest;  relatively  general,  local 
assumptions  are  used  to  infer  edges  of  surfaces  from  image 
boundaries.  This  model  was  unpopular  in  1970.  It  appears 
a  stronger  candidate  now  because  boundary  operators  and 
grouping  operators  use  similar  mecnanlsms  which  appear 
promising  for  implementation  in  VLSI,  and  because  of  the 
introduction  of  a  general  approach  to  Interpretation  of 
image  boundaries  [Binford  61], 

Stereo  Compilation 

The  chief  problem  of  stereo  vision  is  to  find 
correspondences  between  areas  and  edges  in  one  view,  and 
those  in  another  view.  There  is  only  partial 
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correspondence  between  the  two  views  which  differ 
because  -f  geometric  differences;  photometric  differences, 
and  physical  changes  from  one  photo  to  the  next.  Consider 
geometric  image  correspondence,  the  map  between  two 
images,  which  involves  stretching,  folding,  and  cutting, 
corresponding  to  surface  inclination,  discontinuities  in 
tangent  plana,  and  occlusion. 

Much  work  has  been  directed  toward  correspondence 
along  eplpolar  lines  in  an  image  [Henderson  79].  Arnold 
and  Blnford  show  that  there  is  a  tight  constraint  on 
corresponding  edges  [Arnold  80].  For  edges  randomly 
distributed  in  direction,  most  edges  will  appear  to  have 
similar  angles  in  two  views.  This  constraint  Is  restrictive 
Tor  typical  mapping  sequences  of  aerial  photos,  it  is  very 
tight  for  human  stereo.  Half  width  at  half  maximum  for 
true  matches  is  9  degrees  for  human  stereo  at  l'foot.  Ihese 
results  have  consequences  for  biological  stereo  vision 
which  are  consistent  with  experiment.  [Nelson  77]  report 
half  width  at  half  maximum  of  10-20  degrees  for  the  cat. 
If  randomly  distributed  edges  are  mismatched,  i.e. 
different  views  correspond  to  different  edges,  they 
correspond  to  a  population  of  edges  peaked  along  the  lines 
of  sight  to  the  two  cameras. 

The  constraint  on  edge  angles  relates  Isolated  edges 
without  relations  between  edges.  The  next  level  of 
constraint  is  that  ol  surfaces,  An  area  in  an  image 
corresponds  to  a  sutface  in  space.  Consider  the  interval 
between  pairs  of  edges  along  an  eplpolar  line, 
corresponding  to  a  slice  through  the  surface.  For  surfaces 
randomly  oriented  in  space,  projected  intervals  are  nearly 
equal  for  most  surfaces.  Randomly  mismatched  intervals 
correspond  to  surfaces  which  are  peaked  along  the  lines  of 
sights  from  the  two  viewpoints.  Again,  the  constraint  is 
useful  for  mapping  sequences  with  wide  stereo  baseline, 
and  very  tight  for  human  stereo  which  is  small  angle 
stereo. 

For  modeling  expected  edge  and  surface  distributions, 
the  model  of  a  uniform  distribution  on  the  unit  sphere 
(Gaussian  sphere)  is  modified  by  superimposing  a  peak 
along  the  vei  *lcal  and  a  band  along  the  horizontal  plane 
for  cultural  objects  and  the  ground  surface.  Vertical  edges 
in  aerial  photographs  typically  appear  short;  measurement 
of  their  direction  is  feasiole  with  sophisticated  edge 
operators. 

[Clarkson  81]  has  found  a  version  of  the  solution  to 
the  camera  transform  given  only  pairs  of  conjugate  image 
points  without  spatial  location.  The  method  Involves 
solution  of  a  3  parameter  problem  Instead  of  a  five 
parameter  problem,  thus  it  presumably  requires  much  loss 
computation.  The  solution  must  be  within  about  .3 
radians  in  initial  estimates  of  orientation  parameters.  The 
objective  of  this  work  was  to  find  a  solution  which  is 
more  robust  than  that  of  Gennery  given  noise  points 
[Gennery  79],  This  goal  has  not  been  achieved  yet. 

Interpretation 

New  theoretical  results  outline  a  theory  of 
interpretation  of  line  drawings  which  promises  to  Include 
curved  surfaces  with  surface  markings  and  shadows, 
paper  (non-solids),  and  wires  [Blnford  81],  These  results 
appear  to  have  definite  application  for  stereo  and 
photointerpretation.  There  appears  to  bv  a  rich  body  of 
results  to  be  obtained  by  extending  that  line  of  research, 
in  particular,  it  is  planned  to  extend  the  results  to  stereo 


coincidence.  The  approach  begins  by  characterizing 
surfaces  by  their  edges  and  limbs,  i.e.  apparent  edges.  It 
uses  assumptions  of  general  source  position  and  genera) 
observer  position  to  identify  evidence  that  certain  edges 
and  surfaces  intersect  in  space,  and  whether  surfaces  are 
smooth  at  apparent  edges.  Where  curves  are  smooth  at 
Junctions,  assume  that  'be  surface  Is  smooth.  Where 
curves  have  breaks,  assume  that  the  surface  has  a  crease, 
I.e.  discontinuous  tangent  plane.  In  absence  of  other 
information,  If  curves  appear  to  intersect  In  an  Image,  and 
If  their  Inverse  image  Is  sufficiently  constrained,  as  uma 
that  they  are  the  Images  of  curt, ms  which  intersect  in 
space.  The  inverse  image  of  a  curve  in  an  image  is  the  set 
ol  rays  in  space  (a  surface)  which  project  onto  the  curve. 
If  suriace  markings  are  drawings  of  objects,  they  will 
fool  these  interpretations,  but  wh*  rei  shadows  and  surface 
markings  aren't  constructed  per  in  ■ rsely,  they  give  no 
indication  of  solid  objects.  When  shadows  and  surface 
markings  cross  true  edges,  the  true  edge  la  unaffected,  the 
Image  of  the  edge  is  smooth.  Thus  the  system  inters  that 
the  surface  is  smooth  along  the  shadow  or  surface  mark. 
These  results  are  much  more  general  that  previous 
theoretical  studies  ot'  interpretation  ol  line  drawings 
[Guzman  68,  Huffman  71,  Clowes  71,  Waltz  72, 
Mackworth  73,  Turner  74]. 

Where  shadow  information  is  available,  it  enables 
accurate  measurement  of  three-space  positions  of  many 
edges  in  the  scene,  from  which  other  iltmenslons  can  be 
Inferred.  Shadows  provide  information  equivalent  to 
stereo.  Use  of  shadow  information  Is  an  important 
capability  for  geometric  reconstruction  for 
photointerpretation,  bse  of  shadows  mill  be  discussed 
below. 

[Lowe  31]  exploit  some  of  these  iissumptions  in 
interpreting  an  aerial  photograph  of  an  air, :rart.  Figure  la 
shows  image  curves  extracted  by  hand  from  output  of  the 
Nevatia  and  Babu  line  finder  [Nevatia  7(1).  This  is  a  cheat 
which  represents  a  judgment  of  what  line  finders  will 
produce  in  a  few  months.  Figure  lb  show.t  occlusion  cues 
which  provide  a  layered  relative  depth  model;  the 
Tuselage  is  above  the  wing  which  is  a  bow  the  engine  pod 
which  is  above  the  shadow  which  is  on  the  ground;  the 
shadow  is  over  a  white  line  which  must  be  on  a  smooth 
surface.  The  techniques  are  quite  general,  they  do  not 
depend  at  all  on  knowing  the  objects  in  question,  and  they 
use  cues  which  are  universally  available.  Figure  lc  shows 
matching  of  vertices  of  shadow  images  coincident  with 
Dther  Junctions  along  a  projection  from  the  svn 
{coincidence  assumption).  T.ie  system  de  ermines  heights 
of  edges  which  cast  shadows  from  triang.ilatlon  with  the 
known  sun  position.  Figure  id  shows  the  resulting 
three-space  model  of  the  surfaces  of  the  aircraft  seen  from 
several  views. 

There  is 

much  more  still  to  be  extracted  from  these  curve  data. 

[Liebes  81]  has  developed  a  general  formulation  of 
the  principles  of  the  perspective  geometry  of  shadow 
formation  for  local  and  remote  point  and  extended 
sources,  and  has  applied  this  formulation  to  a  variety  of 
basic  geometrical  configurations. 

U  is  proposed  to  follow  this 
investigation  of  shadow*  to  some  further  results  which 
appear  to  bo  directly  ahead. 

These  monocular  interpretations  are  not  a  substitute 
Tor  stereo,  but  they  can  aid  gieatly  in  stereo 
correspondence.  One  problem  with  stereo  systems  has  been 
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that  tl;  ?y  don't  use  shape  as  humans  do  and  they  don't  use 
context.  There  are  at  least  three  classes  of  monocular  cues 
described  above:  first  depth  ordering  of  surfaces  from 
occlusion  cues,  combined  with  object  segmentation: 
second,  shape  description  in  terms  of  generalized 
cylinders*  and  third,  quantitative  depth  models  obtained 
from  analysis  of  shadows.  Correspondence  is  enormously 
simplified  if  there  is  a  qualitative  ordering  of  surfaces, 
and  simplified  more  still  if  there  is  a  quantitative  model. 
That  is,  matching  top  surfaces  with  top  surfaces,  and 
ground  surfaces  with  ground  surfaces  leaves  small  search 
spaces. 

Orthogonal  Trihedral  Vertices 

[Idebes  61]  has  achieved  a  number  of  preliminary 
results  on  the  use  of  geometric  constraints  for  special 
surfaces  in  stereo  matching  for  orthogonal  trihedral 
vertices.  These  provide  monocular  cues  for 
correspondence. 

The  objective  of  determining  a  depth  map  z(.x,y)  can 
be  aided  by  symbolic  description  of  portions  of  the  scene, 
in  some  portions  of  an  image,  little  or  no  depth 
information  can  be  obtained,  e.g.  on  water,  snow, 
concrete,  gravel  roofs,  on  uniform  metal  surfaces,  etc.  The 
best  information  about  surface  height  in  such  cases  can  be 
obtained  by  fitting  a  surface  satisfying  boundary 
conditions  with  symbolic  constraints.  A  lake  Is  an  extreme 
case.  The  condition  applies  to  the  height  of  the  water  at 
the  lake  boundaries  with  the  constraint  that  the  water 
surface  Is  horizontal.  On.  a  horizontal  circular  cylinder, 
the  surface  has  boundary  conditions  that  the  position  at 
the  boundaries  of  visible  surface  is  known  and  the  the 
surface  tangent  is  known  to  be  along  the  line  of  sight. 
Thus,  symbolic  constraints  provide  ways  of  translating 
familiarity  cues  into  improved  measurements. 

The  direction  of  the  gravity  vector  Is  perhaps  the 
single  most  important  determinant  in  the  alignment  of 
architectural  structures.  Walls  of  buildings  and  their 
edges  tend  to  be  vertical.  Most  of  the  visible  area  in  an 
aerial  photograph  of  a  building  is  the  roof.  Roofs  are 
usually  composed  of  planar  sections,  many  horizontal. 
Architectural  structures  contain  many  right  angles  and 
orthogonal  trihedral  vertices,  at  intersections  of  walls  and 
roofs  of  concrete  buildings,  at  door  ways,  windows,  and 
the  like.  These  elements  are  usually  aligned  with  gravity. 
Pipe  lines  and  large  cylindrical  structures  such  as  storage 
tanks  tend  to  have  their  axes  either  horizontally  or 
vertically  aligned. 

Vertical  and  horizontal  surfaces  and  edges  occur  with 
such  frequency  in  cultural  object-  that  it  Is  valuable  to 
work  out  special  case  constraints  for  these  construction 
elements.  We  Intend  to  address  the  important  special  cases 
of  plane  and  cylindrical  structural  elements,  especially 
right  parallellpipeds  and  right  circular  cylinders  aligned 
with  gravity,  vertical  and  horizontal  surfaces.  To  ignore 
these  capabilities  for  use  of  special  structures  is  threw 
away  valuable  information.  Liebos  is  working  to  quantify 
these  special  structure  constraints. 

Consider  the  case  of  orthogonal  trihedral  vertices 
(OTVs),  which  appear  as  internal  and  external  corners  of 
right  parallellpipeds.  Th-i  approach  Is  based  upon  the  use 
of  projective  Invariants.  Images  of  OTVs  show  projective 
distortion  depending  upon  their  orientation  and  range 
relative  to  the  camera.  The  study  of  projective 


transformations  and  stereoscopic  imagery  has  yielded 
valuable  formulations  involving  projective  invariants, 
coordinate  representations,  and  stereo  edge  element 
organisation  and  analysis,  These  formulations  have  been 
applied  to  projective  invariants  of  OTVs,  using  the 
locations  of  the  vanishing  points  for  their  edges,  or 
equivalently  their  surface  normals.  In  an  application  to 
the  important  special  case  oi  nadir-oriented  stereo 
cameras,  a  simple  set  of  projection  and  visibility  rules 
have  been  found  that  uniquely  label  the  corners.  Given  an 
□TV  in  one  image,  the  rules  specify  the  quantitative 
appearance  of  the  corresponding  OTV  in  the  conjugate 
image,  as  a  function  of  relative  displacement  along  the 
associated  epipolar  line.  The  nadir  viewing  case  extends 
directly  to  oblique  viewing  configurations. 

The  simplicity  of  the  rule  formulation  in  nadir 
viewing  arises  from  the  facts  that  the  vertical  edge 
vanishing  point  coincides  with  the  nadir  point,  and  both 
of  the  remaining  OTV  edge  vanishing  points  are  oriented 
at  right  angles  to  one  another  at  Infinite  distance  purallel 
to  the  film  plane.  In  the  more  general  oblique  case,  all 
three  vanishing  points  are  at  finite  distance  from  one 
another  in  the  film  plane.  The  rules  in  the  latter 
circumstance  more  explicitly  utilize  the  projective 
relationship  of  the  edges  of  the  sixteen  different  kinds  of 
corners  to  the  vanishing  points.  Liebes  has  demonstrated 
that  elements  oi  the  approach  extend  to  the  case  of 
vertical  cylinders  with  arbitrary  polygonal  cross  section. 

References 

[Arnold  80]  Arnold,  R.D.,  Binford,  TO.*  "Geometric 
Constraints  in  Stereo  Vision"  Proc  SPIE,  San  Diego,  Cal, 
July  1380. 

[Binford  85]  T.O.Binford:  “Interring  Surfaces  from 
Images";  Artificial  Intelligence  Journal  forthcoming, 
1 9U 1 . 

[Brooks  79]  Brooks,  R.A.,  Greiner,  R.,  Binford,  T.O.* 
"ACRONYM*  A  Model-Based  Vision  System"*  Proc  Int  Jt 
Conf  on  AI,  Aug  1 979. 

[Brooks  79b  J  Brooks,  R.A.i  "Goal- Directed  Edge  Linking 
and  Ribbon  Finding";  Proc  Image  Understanding 
Workshop,  Palo  Alto,  Cellf,  Apr  1979. 

[Crooks  80]  Brooks,  R.A.,  Binford,  T.O.*  "Interpretive 
Vision  and  Restriction  Graphs ";  Proc  First  National  AAAI 
Conference,  Stanford,  Calif,  August  1980. 

[Brooks  81]  R.A.  Brooks;  "Model-Based 
Three-Dimensional  Interpretations  of  Two-Dimensional 
Images";  Proc  Image  Understanding  Workshop,  April 
1981. 

[Clarkson  81]  K.L.  Clarkson*  "A  Procedure  for  Camera 
Calibration";  Proc  Image  Understanding  Workshop,  April 
1981. 

[Clowes  71]  Clowes, M.B.*  "On  Seeing  Things Al 
Journal,  1971. 

[Gennery  79b]  Gennery,  D.B.*  "Stereo  Camera  Solver"; 
Proc  Image  Understanding  Workshop  Nov  1979,  USC,  Los 
Angeles. 

[Henderson  79]  H.L. Henderson,  W.J.MiUer,  C.B.Grcsch* 
"Automatic  Stereo  Reconstruction  of  Man-Made  Targets"; 
SPIE  Proc.  Huntsville,  Aug  1979. 

'Horn  72]  B.K.P.Horn*  "The  Btnford-Horn  Edge  Finder"; 
WIT  Al  Memo  285,  1972,  revised  December  1973. 
Huffman  71]  Huffman, D.A.*  "Impossible  objects  as 
nonsense  sentences "*  Machine  Intelligence  8,  1971. 


239 


[Llebes  81]  S.Liebes,  Jr.»  "Geometric  O ■ynstralnts  tor 
l nterpreting  Images  of  Common  Structural  Elemental 
Orthogonal  Trihedral  Vertices*  Ptoc  Image 
Understanding  Workshop,  April  1881. 

[Lows  80]  Lowe,  D.Q.i  "Solving  for  the  Parameters  of 
Object  Models  from  Image  Descriptions" i  Proc  Image 
Understanding  Workshop,  Unlv  of  Md,  April  1880. 

[Lews  SI]  D.  Lowe,  T.O.Blnford;  "The  Interpretation  of 
Geometric  Structure  from  Image  bounder, es\"  Proc  Image 
Understanding  Workshop,  April  1981. 

[Mackworth  73]  Mackworth,  A.K.i  "Interpreting 
Picttires  of  Polyhedral  Scenes" \  Al  Journal  4,  { 1 973). 
[Mac Vicar- Whelan  81]  P.MacVicar-Whelan,  T..O.Binfordi 
"line-finding  lo  sub- Pixel  Precision" i  Proc  Image 
Understanding  Workshop,  April  1981. 

[Mart  77]  1).  Mart.  T.  Poggio;  "/i  Theory  o'  Human  Stereo 
Vision,"  AI  Memo  46t,  MIT.  Nov  1977. 

[Marr  79]  D.Marr,  E.&lldteth;  "1  heory  of  Edge 
Detection" i  At  Memo  513,  Al  Lab  MIT,  April  1979. 

[Miller  81]  A  Miller,  M.  Lowry;  "General  purpose  VLSI 
chip  with  fault  tolerant  hardware  for  image 
processing  Proc  Image  Understanding  Workshop,  April 
198). 


[Nelson  77]  J  1  Nelson,  H  Kato,  Sc  P  0  Bishop* 
"Discrimination  of  orientation  and  position  disparities 
by  binoculariy  activated  neurons  in  cat  striata  cortax"i  J 
Neurophysiology,  40(2);260-283  1977, 

[Nevatla  78]  Nevatla,  B.,  K.R.Babu;  "Linaar  Feature 
Extraction^  Proc.  AHPA  Imagt  Understanding  Workshop, 
Pittsburgh,  Nov.  1978,  73-78.)); 

[Turner  74]  Turner.K.;  "Computer  Parcaption  of  Curved 
Objects  using  a  Television  Centers PhD,  Uaiv  of 
Edinburgh,  1974, 

[Ullman  79]  S.  Ullman;  "The  interpretation  of  structure 
from  motion" i  M1T-AI  Memo  476,  1978. 

[Waltz  72]  WaltzJJ.i  " Generating  Semantic  Descriptions 
from  Drawings  of  Scenes  with  Shadows MIT-AI  Tech 
Rept  AI-TR-27 1 ,  1972.elso  "Understanding  Line 

Drawings  of  Scenes  with  Shadows"i  P.H. Winston,  ed; 
McQraw-HUl,  1975. 


PROGRESS  IN  THE  USC  IMAGE  UNDERSTANDING  PROGRAM 


R,  Nevatia  and  A. A.  Sawchuk 
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University  of  Southern  California 
Los  Angeles,  California  90007 


Our  goals  have  been  to  develop  techniques 
that  have  wide  utility  and  to  test  them  on 
problems  of  image  to  map  correspondence  and  scene 
matching.  This  paper  is  a  summary  of  our  program 
for  the  last  year.  Detailed  of  these  projects  may 
be  found  in  [1-2J. 

SCENE  MATCHING 

We  hove  continued  to  work  on  the  problems  of 
matching  two  images  of  a  scene,  or  one  image  with 
a  map,  by  generating  symbolic  descriptions  from 
each.  We  have  implemented  matching  techniques 
using  search  as  well  as  relaxation  techniques. 
New  results  using  the  latter  are  described  in  an 
accompanying  paper  [3].  We  feel  that  our  system 
can  handle  complex  aerial  images,  the  major 
limitations  being  due  to  failures  of  the 
segmentation  procedures.  We  have  initiated  work 
to  use  the  map  as  a  guide  to  such  segmentation. 

SYMBOLIC  TEXTURE  ANALYSIS 

We  have  a  system  in  highly  developed  form  for 
describing  natural  textures  in  terms  of  their 
primitives  and  relations  among  them.  T,ie 
descriptions  consist  of  the  primitive  sizes  and 
repetition  patterns  if  any.  A  statistical 
analysis  of  our  technique  agrees  with  the 
experimental  results.  The  descriptions  have  been 
tested  for  texture  identification  and  we  are 
investigating  their  use  to  estimate  surface 
orientation  form  texture  gradients.  This 
technique  is  described  in  detail  in  another 
eccompanying  paper  [4], 

TEXTURE  SYNTHESIS  AND  ANALYSIS 

We  have  been  working  on  several  different 
statistical  techniques  for  synthesizing  natural 
textures.  Both  gray  level  and  binary  textures  can 
be  synthesized,  and  ten  distinct  techniques  with 
various  tradeoffs  have  been  explored.  The 
tradeoff  parameters  include  such  factors  as 
computation  time  for  generation,  computation  time 
for  data  collection,  memory  requirements,  and 
quality  of  simulation.  hteny  commonly  occurring 
natural  textures  have  been  adequately  simulated 
using  very  simple  models,  providing  potentially 
great  information  compression  for  many 
applications.  Other  textures  with  macrostructure 
and  nonstar.ionary  characteristics  require  more 
extensive  computation  to  synthesize  realistic, 
visually  pleasing  results.  Although  the  success 


of  any  synthesis  method  is  highly  dependent  on  the 
texture  itself  and  the  modeling  scheme  chosen, 
general  guidelines  for  predicting  the  performance 
of  various  techniques  have  been  developed. 

We  also  hope  to  use  these  techniques  for 
texture  classification  and  image  segmentation. 
Some  preliminary  experiments  employing  statistical 
feature  selection  and  classification  techniques 
for  discrimination  have  been  undertaken  by  this 
approach.  Another  paper  included  in  these 
proceedings  [5]  describes  the  texture  synthesis 
results  in  more  detail.  Additional  detail  is 
contained  in  a  forthcoming  semi-annual  technical 
report  [2]. 

OTHER  PROJECTS 

We  are  continuing  to  make  improvements  to  our 
road  finding  and  linear  feature  extraction 
programs.  We  have  incorporated  Laplacian-Gaussian 
masks,  suggested  by  Marr,  as  an  alternative  for 
low  level  edge  extraction.  We  have  also 
implemented  techniques  of  connecting  segments  to 
give  more  continuous  boundaries. 

We  are  also  investigating  the  use  of 
hierarchical  gradient  relaxation  techniques  for 
matching  of  2-0  and  3-D  shapes.  We  hope  to 
present  results  in  a  later  paper. 

HARDWARE  IMPLEMENTATION 

In  continuing  work  with  Hughes  Research 
Laboratories,  Malibu,  California,  we  are  exploring 
architecture  and  hardware  issues  in  the 
implementation  of  image  understanding  algorithms 
by  VLSI  techniques.  Hie  initial  part  of  the  study 
has  concentrated  on  three  algorithms:  a) 
Nevatia-Babu  Line  Finder  [6];  b)  Ohlander-Price 
Region  Segmentor  [7]j  and  c)  Laws  Texture  Analysis 
System  [8J.  These  three  algorithms  are  all  very 
computation  intensive  and  have  a  broad  range  of 
applications  in  image  understanding  research. 
Common  to  algorithms  a)  and  c)  are  extensive 
tvc-dimonsional  convolutional  processing, 
especially  in  the  early  stages  of  the  algorithm. 
This  convolutional  processing  is  largely  local  and 
is  well  matched  to  the  nature  of  VLSI  systems,  in 
vtiich  interconnections  are  difficult  to  implement. 

More  recently,  the  convolution  problem  has 
led  to  a  detailed  design  for  a  5x5  pixel 
convolution  processor  based  on  residue  arithmetic. 
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The  system  is  called  RADIUS  (Residue  Arithmetic 
Digital  Image  Understanding  System).  The  residue 
arithmetic  approach  has  the  advantages  of 
modularity,  ease  of  design,  programmability,  broad 
application  to  many  problems,  and  ease  of 
implementation  in  many  integrated  circuit 
technologies  with  sufcmicron  structures.  Using 
residue  arithmetic,  there  are  no  carries  in  the 
numerical  computation  and  minimal  interconnections 
on  the  chip  are  required.  Very  high  speed 
processing  is  possible  because  many  of  the 
numerical  operations  reduce  to  table  lookups  in 
binary  digital  RAM's.  Special  purpose  integrated 
circuits  to  perform  part  of  the  processing  are 
being  fabricated,  and  the  construction  of  an 
experimental  system  is  in  progress.  In  addition 
to  the  increased  computational  speed  for 
convolution,  the  processor  has  applications  in 
evaluation  of  polynomial  functions,  integer 
coefficient  transforms,  enhancement  operations, 
and  moment  calculations.  Additional  details  are 
contained  in  a  paper  in  these  proceedings  [9). 
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1 .  Technical  Contributions 

1.1.  Parameter  Networks  and  the  Hough 
Transform 

One  of  the  most  difficult  problems  in 
vision  is  segmentatior, .  Recent  work  has 
shown  how  to  calculate  intrinsic  images 
(e.g.,  optical  flow,  surface  orientation, 
occluding  contour,  and  disparity.)  These 
images  are  distinctly  easier  to  segment 
than  the  original  Intensity  images. 

The  Hough  transform  idea  has  been 
developed  into  a  general  control 
technique.  Intrinsic  image  points  are 
mapped  (many  to  one)  into  "parameter 
networks"  [Ballard,  1980).  This  theory 
explains  segmentation  in  terms  of  highly 
parallel  cooperative  computation  among 
Intrinsic  images  and  a  set  of  parameter 
spaces  at  different  levels  of  abstraction. 

Our  work  with  two-dimensional  Hough 
transform  techniques  has  been  reported 
previously  [  Ballard,  1979;  Sloan  and 
Ballard,  1980).  The  most  recent 
application  of  these  ideas  is  to  the 
problem  of  detecting  the  presence  and 
orientation  of  rigid,  three-dimensional 
objects  [Ballard,  1981;  Ballard  and 
Sabbah,  1981). 

Hough-like  techniques  involving 
high-dimensional  transform  spaces  have 
prompted  a  need  for  Dynamically  Quantized 
Spaces.  We  have  recently  developed  a  data 
structure,  based  on  the  pyramid,  which  can 
cover  a  parameter  space  with  a  limited 
number  of  accumulators  in  such  a  way  that 
fine  precision  is  maintained,  where  it  is 
needed.  This  data  structure  has  the 
advantage  that  its  resource  allocation  and 
connections  are  fixed.  It  differs  from 
the  usual  pyramid  in  that  the  boundaries 
of  elements  in  the  pyramid  are  continually 
modified  by  a  hierarchical  warping 
process.  Essentially,  each  cell  tries  to 
track  the  mean  position  of  votes  in  its 
part  of  the  spaoe.  This  estimate  of  the 
local  mean  is  used  to  define  the 
boundaries  of  the  cell's  position  [Sloan, 
1981  ) 


1.2  Computing  with  Connections 

There  is  a  rapidly  growing  interest 
in  problem-scale  parallelism,  both  as  a 
model  of  animal  brains  and  as  a  paradigm 
for  VLSI.  Work  at  Rochester  has 
concentrated  on  connectionist  models  and 
their  application  to  vision.  The 
framework  is  built  around  computational 
modules,  the  simplest  of  which  are  termed 
p-units.  We  have  developed  their 
properties  and  shown  how  they  can  be 
applied  to  a  variety  of  problems  . [Feldman 
and  Ballard,  1980). 

To  show  how  the  framework  can  be 
applied  to  computational  problems  in 
vision,  three  specific  examples  have  been 
developed  in  some  detail.  In  the  first, 
spatially  distributed  data  can  be 
associated  with  a  complex  concept.  The 
second  problem  is  related  to  eye 
movements,  namely,  how  the  eyes  can  be 
directed  to  move  to  interesting  spatial 
features  and  how  to  avoid  "reparsing"  the 
images  when  those  movements  occur. 
Finally,  we  have  considered  the  shape  from 
shading  problem  and  shown  how  a  global 
parameter,  such  as  light  source  position, 
interacts  with  the  calculation  of  a 
spatially  distributed  parameter  such  as 
surface  orientation. 

1 . 3  Shape 

A  convenient  representation  for 
blob-like  figures  in  an  image  consists  of 
the  orientation,  length,  and  width  of  a 
bounding  reotangle.  One  fast  algorithm 
for  producing  such  a  bounding  rectangle  is 
based  upon  a  dot  product  spree.  The 
analysis  of  the  *  dot  product.  spaoe 
representation  has  been  improved  to  handle 
certain  pathological  cases,  and  has  been 
generalized  to  accommodate  different 
criteria  for  the  goodness  of  the 
representation  [Sloan,  1980). 

This  simple  concept  of  shape  has  been 
applied  to  the  problem  of  reconstructing 
three-dimensional  surfaces  from  very 
sparse  data.  The  key  idea  is  to  use 
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appropriate  shape  desorlptors  to 
hypothesize  a  transformation  which 
aooounts  for  the  difference  in  shape 
between  suooessive  oontours.  When  the 
hypothesized  transformation  is  minor,  very 
simple-minded  surface  reconstruction 
techniques  are  sufficient.  When  there  are 
major  differences  in  shape  or  position 
between  successive  oontours,  our  method 
hallucinates  new  contours,  using  the 
hypothesized  shape  transformation  tSloan 
and  Hreohanyk,  1981]. 

I.it.  Adaptive  Operators 


Control  is  a  orucial  issue  in  Image 
Understanding.  We  have  been  investigating 
the  role  of  low-level  adaptive  operators 
in  both  the  analysis  of  aerial  images  and 
in  problem  solving.  The  aerial  image  work 
is  reported  on  elsewhere  in  these 
proceedings  [Selfridge  and  Sloan,  1981]. 

In  general,  problem  solvers  cannot 
hope  to  create  plans  that  are  able  to 
fully  specify  all  the  details  of  operation 
beforehand  and  must  depend  on  run-time 
modification  of  the  plan  to  insure  correct 
functioning.  Fortunately,  many  primitive 
actions  are  highly  stereotyped  and  can  be 
performed  by  adapting  pre-programmed 
tactics  to  the  current  goal  context  and 
operating  environment.  The  architecure 
and  operation  of  Adaptive  Modules 
[Russell,  1981]  demonstrates  this  appronoh 
to  handling  the  problem  of  executing  a 
low-level  plan  in  a  manner  which  uses 
multiple,  parallel  and  distributed  souroes 
of  knowledge. 

1  •>  Nodical  Applications 

A  system  has  been  built  in  which 
Computer  Tomograms  of  the  human  abdomen 
are  searched  as  a  3-D  image  and  matohed 
against  a  detailed  geometrical  model  of 
the  abdomen  anatomy.  Detected  organ 
boudaries  serve  to  construct  an  instanoe 
of  the  model  that  reflects  the  actual 
anatomy  of  a  particular  patient  as 
revealed  by  the  corresponding  image  data. 

The  model-direoted  approach  makes 
possible  the  detection  of  hard-to-find 
organs  (e.g.,  kidneys)  based  on  known 
locations  of  easy-to-find  organs  (e.g., 
spinal  oolumn)  ,  thus  relaxing  the  problem 
of  obscured  boundaries  in  noisy  data  that 
tend  to  hinder  data-dir ected  approaches. 

The  model  is  hierarchical ,  built  of 
Generalized  Cylinders,  and  is  inherently 
parallel.  It  captures  relational, 
structural,  and  quantitative  knowledge 
that  is  represented  as  both  data  and 
procedures  [Shsni,  1980J. 


2.  System  Support 
2.1.  Hardware 

The  Grinnell  GMR-26  display  device  is 
DNA-lnterfaoed  to  an  Eclipse  oomputer ,  and 
has  been  invaluable  as  an  output  devloe 
for  our  experiments.  An  Optronlos 
Colorsoan  C-4100  drum  scanner  is  on  site 
and  interfaces  to  the  Vision  Eclipse. 

Both  Eclipse  computers  are  fully 
configured  and  have  been  running 
effectively  with  our  distributed  system 
software.  A  VAX  11/ f 80  (purchased  with 
non-DoD  funds)  is  operating  and  has  been 
integrated  into  the  local  network.  A  new, 
larger  capacity  Eolipae  has  been  added  to 
the  gateway  configuration,  giving  greater 
capaoity  and  reliability.  We  are 
currently  installing  several  additional 
personal  computers  and  a  laser  printer. 

3.2.  Software 

We  have  been  working  closely  with 
other  IU  contractors  (particularly  CMU)  to 
develop  a  uniform  communication  facility 
for  use  in  the  testbed. 

Local  image-processing  and  graphics 
software  has  been  transferred  to  the  VAX, 
in  C. 

The  external  representation  for  PUTS 
style  messages  has  been  specified,  and  is 
being  implemented  for  transmission  of 
self 'describing  messages.  A 
Name-Type-Value  ( NTV)  message  is 

conceptually  an  unordered  set  of  triples 
consisting  of  a  name,  a  data-type,  and  a 
value  of  the  given  data-type.  These 
triples  are  called  slots.  The  external 
format  of  these  messages  is  used  by  the 
communication  subsystems.  We  make  no 
other  assumptions  about  other  information 
which  may  be  required  by  the  communication 
subsystem  (e.g.,  headers  or  checksums), 
nor  do  we  require  that  a  message  fit 
inside  a  single  physioal  packet. 
Logically  we  model  the  external  format  of 
a  message  as  a  variable  length  vector  of 
octets  (eight  bit  bytes)  [Low,  1980]. 
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